Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Yes, sure, thx, I understand now - but maybe not - the context I was something like this: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I know in other contexts of IR frequent terms are penalized but in this context it seems that frequent terms should be fine... -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IRC?!
There isn't a Lucene IRC room is there (at least there isn't according to Google)? I just joined #lucene on irc.freenode.net if anyone is interested... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: question on Hits.doc
Hello Roy, This sounds normal. When you pull a Document from Hits, you are really pulling it from the disk. All fields are read from disk at that time (i.e. no lazy loading of fields), so if you have large text fields, this is going to result in a lot of disk IO. You could try running vmstat or sar (I'm assuming you are using a UNIX flavour) and look at the bi/bo (really just bo) column (bo = blocks out -- data read from disks). There is not much you can do. If you don't have to store the field, they will probably help. Some people are working on adding support for field compression, so maybe that will help. Otis --- [EMAIL PROTECTED] wrote: > Hey guys, > > We were noticing some speed problems on our searches and after adding > some > debug statements to the lucene source code, we have determined that > the > Hits.doc(x) is the problem. (BTW, we are using Lucene 1.2 [with > plans to > upgrade]). It seems that retrieving the actual Document from the > search is > very slow. > > We think it might be our "Message" field which stores a huge amount > of text. > We are currently running a test in which we won't "store" the > "Message" field, > however, I was wondering if any of you guys would know if that would > be the > reason why we're having the performance problems? If so, could > anyone also > please explain it? It seemed that we weren't having these > performance > problems before. Has anyone else experienced this? Our environment > is NT 4, > JDK 1.4.2, and PIIIs. > > I know that for large text fields, storing the field is not a good > practice, > however, it held certain conveniences for us that I hope to not get > rid of. > > Roy. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
question on Hits.doc
Hey guys, We were noticing some speed problems on our searches and after adding some debug statements to the lucene source code, we have determined that the Hits.doc(x) is the problem. (BTW, we are using Lucene 1.2 [with plans to upgrade]). It seems that retrieving the actual Document from the search is very slow. We think it might be our "Message" field which stores a huge amount of text. We are currently running a test in which we won't "store" the "Message" field, however, I was wondering if any of you guys would know if that would be the reason why we're having the performance problems? If so, could anyone also please explain it? It seemed that we weren't having these performance problems before. Has anyone else experienced this? Our environment is NT 4, JDK 1.4.2, and PIIIs. I know that for large text fields, storing the field is not a good practice, however, it held certain conveniences for us that I hope to not get rid of. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Daniel Taurat wrote: Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) Depends on what OS and with what patches... Linux on i386 seems to have a physical limit of 1.7G (256M for VM) ... There are some patches to apply to get 3G but only on really modern kernels. I just need to get Athlon systems :-/ Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermInfo using 300M for large index?
I'm trying to do some heap debugging of my application to find a memory leak. Noticed that org.apache.lucene.index.TermInfo had 1.7M instances which consumed 300M ... this is of course for a 40G index. Is this normal and is there any way I can streamline this? We are of course caching the IndexSearchers but I want to reduce the memory footprint... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Daniel Naber wrote: On Thursday 09 September 2004 18:52, Doug Cutting wrote: I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? Like this one? konvens leitseite Leitseite is only in the title of the first match (www.gldv.org), konvens is only in the body. Good job finding that! I guess I should fix Nutch's BasicQueryFilter. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks,
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ---
Re: combining open office spellchecker with Lucene
eks dev wrote: Hi Doug, Perhaps. Are folks really better at spelling the beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars match). Jaro-Winkler is well documented and some folks thinks that it is much more efficient and precise than plain Edit distance (of course for normal language, not numbers or so). I will try to dig-out some references from my disk on Good ole Citeseer finds 2 docs that seem relevant: http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any I have some of the ngram spelling suggestion stuff, based on earlier msgs in this thread, working in my dev tree. I'll try to get a test site up later today for people to fool around with. this topic, if you are interested. On another note, I would even suggest using Jaro-Winkler distance as default for fuzzy query. (one could configure max prefix required => prefix query to reduce number of distance calculations). This could speed-up fuzzy search dramatically. Hope this was helpful, Eks ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ---
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
The Parser is pdfBox. pdf is about 25% of the over all indexing volume on the productive system. I also have word-docs and loads of hmtl resources to be indexed. In my testing environment I merely have 5 pdf docs and still those permanent object hanging around, though. Cheers, Daniel Ben Litchfield wrote: I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. <> What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents
hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) >-Original Message- >From: Daniel Taurat [mailto:[EMAIL PROTECTED] >Sent: 10 September 2004 14:42 >To: Lucene Users List >Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number >of documents > > >Hi Pete, >good hint, but we actually do have physical memory of 4Gb on the >system. But then: we also have experienced that the gc of ibm jdk1.3.1 >that we use is sometimes >behaving strangely with too large heap space anyway. (Limit seems to be >1.2 Gb) >I can say that gc is not collecting these objects since I forced gc >runs when indexing every now and then (when parsing pdf-type objects, >that is): No effect. > >regards, > >Daniel > > >Pete Lewis wrote: > >>Hi all >> >>Reading the thread with interest, there is another way I've come >across out >>of memory errors when indexing large batches of documents. >> >>If you have your heap space settings too high, then you get >swapping (which >>impacts performance) plus you never reach the trigger for garbage >>collection, hence you don't garbage collect and hence you run out >of memory. >> >>Can you check whether or not your garbage collection is being triggered? >> >>Anomalously therefore if this is the case, by reducing the heap space you >>can improve performance get rid of the out of memory errors. >> >>Cheers >>Pete Lewis >> >>- Original Message - >>From: "Daniel Taurat" <[EMAIL PROTECTED]> >>To: "Lucene Users List" <[EMAIL PROTECTED]> >>Sent: Friday, September 10, 2004 1:10 PM >>Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large >number of >>documents >> >> >> >> >>>Daniel Aber schrieb: >>> >>> >>> On Thursday 09 September 2004 19:47, Daniel Taurat wrote: >I am facing an out of memory problem using Lucene 1.4.1. > > > > Could you try with a recent CVS version? There has been a fix >about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel >>>Well, it seems not to be files, it looks more like those SegmentTermEnum >>>objects accumulating in memory. >>>#I've seen some discussion on these objects in the developer-newsgroup >>>that had taken place some time ago. >>>I am afraid this is some kind of runaway caching I have to deal with. >>>Maybe not correctly addressed in this newsgroup, after all... >>> >>>Anyway: any idea if there is an API command to re-init caches? >>> >>>Thanks, >>> >>>Daniel >>> >>> >>> >>>- >>>To unsubscribe, e-mail: [EMAIL PROTECTED] >>>For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> >>- >>To unsubscribe, e-mail: [EMAIL PROTECTED] >>For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
> I can say that gc is not collecting these objects since I forced gc > runs when indexing every now and then (when parsing pdf-type objects, > that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Addition to contributions page
I hope I'm posting this request to the right list. The page that lists 3rd-party tools that work with Lucene ( http://jakarta.apache.org/lucene/docs/contributions.html ) says that to be added to the the page, one should send a message to one of the Lucene mailing lists. So, that's what I'm doing. PDFTextStream should be added to the 'Document Converters' section, with this URL < http://snowtide.com >, and perhaps this heading: 'PDFTextStream -- PDF text and metadata extraction'. The 'Author' field should probably be left blank, since there's no single creator. Thanks much, Chas Emerick | [EMAIL PROTECTED] PDFTextStream: fast PDF text extraction for Java apps and Lucene http://snowtide.com/home/PDFTextStream/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents > Daniel Aber schrieb: > > >On Thursday 09 September 2004 19:47, Daniel Taurat wrote: > > > > > > > >>I am facing an out of memory problem using Lucene 1.4.1. > >> > >> > > > >Could you try with a recent CVS version? There has been a fix about files > >not being deleted after 1.4.1. Not sure if that could cause the problems > >you're experiencing. > > > >Regards > > Daniel > > > > > > > Well, it seems not to be files, it looks more like those SegmentTermEnum > objects accumulating in memory. > #I've seen some discussion on these objects in the developer-newsgroup > that had taken place some time ago. > I am afraid this is some kind of runaway caching I have to deal with. > Maybe not correctly addressed in this newsgroup, after all... > > Anyway: any idea if there is an API command to re-init caches? > > Thanks, > > Daniel > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Hi Doug, > Perhaps. Are folks really better at spelling the > beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars match). Jaro-Winkler is well documented and some folks thinks that it is much more efficient and precise than plain Edit distance (of course for normal language, not numbers or so). I will try to dig-out some references from my disk on this topic, if you are interested. On another note, I would even suggest using Jaro-Winkler distance as default for fuzzy query. (one could configure max prefix required => prefix query to reduce number of distance calculations). This could speed-up fuzzy search dramatically. Hope this was helpful, Eks ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
. I reckon there has been a discussion (and solution :-) on how to achieve the functionality you've been after: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 I'm not sure if this would be the same though. Best regards, René Hi all, I took the code indicated by Rene but I've seen that it's not completly feeting my requirements, because my application should provide the facility to check queries as beeing Fuzzy queries. so I modified the code to the following one, and I added a test main method. Hope it helps someone. package org.apache.lucene; /* @(#) CWK 1.5 10.09.2004 * * Copyright 2003-2005 ConfigWorks Informationssysteme & Consulting GmbH * Universitätsstr. 94/7 9020 Klagenfurt Austria * www.configworks.com * All rights reserved. */ import java.util.Vector; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.Query; /** * @author sergiu * this class is a patch for MultifieldQueryParser * it's behaviour can be tested by running the main method * * Now: String[] fields = new String[] { "title", "abstract", "content" }; QueryParser parser = new CustomQueryParser(fields, new SimpleAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Query query = parser.parse("foo -bar (baz OR title:bla)"); System.out.println("? " + query); Produces: ? +(title:foo abstract:foo content:foo) -(title:bar abstract:bar content:bar) +((title:baz abstract:baz content:baz) title:bla) Perfect! * @version 1.0 * @since CWK 1.5 */ public class CustomQueryParser extends QueryParser{ private String[] fields; private boolean fuzzySearch = false; public CustomQueryParser(String[] fields, Analyzer analyzer){ super(null, analyzer); this.fields = fields; } public CustomQueryParser(String[] fields, Analyzer analyzer, int defaultOperator){ super(null, analyzer); this.fields = fields; setOperator(defaultOperator); } protected Query getFieldQuery(String field, Analyzer analyzer, String queryText) throws ParseException{ Query query = null; if (field == null){ Vector clauses = new Vector(); for (int i = 0; i < fields.length; i++){ if(isFuzzySearch()) clauses.add(new BooleanClause(super.getFuzzyQuery(fields[i], queryText), false, false)); else clauses.add(new BooleanClause(super.getFieldQuery(fields[i], analyzer, queryText), false, false)); } query = getBooleanQuery(clauses); }else{ if (isFuzzySearch()) query = super.getFuzzyQuery(field, queryText); else query = super.getFieldQuery(field, analyzer, queryText); } return query; } public boolean isFuzzySearch() { return fuzzySearch; } public void setFuzzySearch(boolean fuzzySearch) { this.fuzzySearch = fuzzySearch; } public static void main(String[] args) throws Exception{ String[] fields = new String[] { "title", "abstract", "content" }; CustomQueryParser parser = new CustomQueryParser(fields, new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); parser.setFuzzySearch(true); String queryString = "foo -bar (baz OR title:bla)"; System.out.println(queryString); Query query = parser.parse(queryString); System.out.println("? " + query); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]