Questions about GermanAnalyzer/Stemmer
Hello, Were using the GermanAnalyzer/Stemmer to index/search our (German) Website. I have a few questions: (1) Why is the GermanAnalyzer case-sensitive? None of the other language indexers seem to be. What does this feature add? (2) With the German Analyzer, wildcard searches containing extended German characters do not seem to work. So, a* is fine but anä* or ö* always find zero results. (3) In a similar vein to (2), wildcard searches with escaped special characters fail to find results. So a search for co\-operative works but a search for co\-op* fails. I will be grateful for any light that can be shed on these problems. With Thanks, Jon. Jon Humble BSc (hons,) Software Engineer eMail: [EMAIL PROTECTED] TecSphere Ltd Centre for Advanced Industry Coble Dene, Royal Quays Newcastle upon Tyne NE29 6DE United Kingdom Direct Dial: +44 (191) 270 31 06 Fax: +44 (191) 270 31 09 http://www.tecsphere.com
Is IndexSearcher thread safe?
Is it thread-safe to share one instance of IndexSearcher between multiple threads? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
Jon, I too found some problems with the German analyser recently. Here's what may help: 1. You can try reading Joerg Caumanns' paper A Fast and Simple Stemming Algorithm for German Words. This paper describes the algorithm implemented by GermanAnalyser. 2. I guess German nouns all capitalized, so maybe that's why. Although you would want to be indexing well written German and not emails or text messages! 3. The German Stemmer converts umlauts into some funny form (the code is a bit tricky, and I didn't spend any time looking at it), so maybe thats why you can't find umlauts properly. I think the main reason for this umlaut change is that many plurals are formed by umlauting: E.g. Haus, Haeuser (that ae is a umlaut). Finally, to really understand what's happening, get your hands on Luke. I just got it last week, and its brilliant. It shows you everything about your indexes. You can also feed text to an Analyser, and see what it makes of it. This will show you the real reason why your umlaut search is failing. Ciao, Jonathan O'Connor XCOM Dublin Jon Humble [EMAIL PROTECTED] 01/03/2005 09:35 Please respond to Lucene Users List lucene-user@jakarta.apache.org To lucene-user@jakarta.apache.org cc Subject Questions about GermanAnalyzer/Stemmer [auf Viren geprueft] Hello, We?re using the GermanAnalyzer/Stemmer to index/search our (German) Website. I have a few questions: (1) Why is the GermanAnalyzer case-sensitive? None of the other language indexers seem to be. What does this feature add? (2) With the German Analyzer, wildcard searches containing extended German characters do not seem to work. So, a* is fine but anä* or ö* always find zero results. (3) In a similar vein to (2), wildcard searches with escaped special characters fail to find results. So a search for co\-operative works but a search for co\-op* fails. I will be grateful for any light that can be shed on these problems. With Thanks, Jon. Jon Humble BSc (hons,) Software Engineer eMail: [EMAIL PROTECTED] TecSphere Ltd Centre for Advanced Industry Coble Dene, Royal Quays Newcastle upon Tyne NE29 6DE United Kingdom Direct Dial: +44 (191) 270 31 06 Fax: +44 (191) 270 31 09 http://www.tecsphere.com *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen. This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
I had to moderate both Jonathan and Jon's messages in to the list. Please subscribe to the list and post to it with the address you've subscribed. I cannot always guarantee I'll catch moderation messages and send them through in a timely fashion. Erik On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote: Jon, I too found some problems with the German analyser recently. Here's what may help: 1. You can try reading Joerg Caumanns' paper A Fast and Simple Stemming Algorithm for German Words. This paper describes the algorithm implemented by GermanAnalyser. 2. I guess German nouns all capitalized, so maybe that's why. Although you would want to be indexing well written German and not emails or text messages! 3. The German Stemmer converts umlauts into some funny form (the code is a bit tricky, and I didn't spend any time looking at it), so maybe thats why you can't find umlauts properly. I think the main reason for this umlaut change is that many plurals are formed by umlauting: E.g. Haus, Haeuser (that ae is a umlaut). Finally, to really understand what's happening, get your hands on Luke. I just got it last week, and its brilliant. It shows you everything about your indexes. You can also feed text to an Analyser, and see what it makes of it. This will show you the real reason why your umlaut search is failing. Ciao, Jonathan O'Connor XCOM Dublin Jon Humble [EMAIL PROTECTED] 01/03/2005 09:35 Please respond to Lucene Users List lucene-user@jakarta.apache.org To lucene-user@jakarta.apache.org cc Subject Questions about GermanAnalyzer/Stemmer [auf Viren geprueft] Hello, We?re using the GermanAnalyzer/Stemmer to index/search our (German) Website. I have a few questions: (1) Why is the GermanAnalyzer case-sensitive? None of the other language indexers seem to be. What does this feature add? (2) With the German Analyzer, wildcard searches containing extended German characters do not seem to work. So, a* is fine but anä* or ö* always find zero results. (3) In a similar vein to (2), wildcard searches with escaped special characters fail to find results. So a search for co\-operative works but a search for co\-op* fails. I will be grateful for any light that can be shed on these problems. With Thanks, Jon. Jon Humble BSc (hons,) Software Engineer eMail: [EMAIL PROTECTED] TecSphere Ltd Centre for Advanced Industry Coble Dene, Royal Quays Newcastle upon Tyne NE29 6DE United Kingdom Direct Dial: +44 (191) 270 31 06 Fax: +44 (191) 270 31 09 http://www.tecsphere.com *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen. This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Custom filters document numbers
I'm also interested in knowing what can change the doc numbers. Does this happen frequently? Like Stanislav has been asking... what sort of operations on the index cause the document number to change for any given document? If the document numbers change frequently, is there a straightforward way to modify Lucene to keep the document numbers the same for the life of the document? I'd like to have mappings in my sql database that point to the document numbers that Lucene search returns in its Hits objects. Thanks, -Tom- --- Stanislav Jordanov [EMAIL PROTECTED] wrote: The first statement is clear to me: I know that an IndexReader sees a 'snapshot' of the document set that was taken in the moment of the Reader's creation. What I don't know is whether this 'snapshot' has also its doc numbers fixed or they may change asynchronously. And another thing I don't know is what are the index operations that may cause the (doc - doc number) mapping to change. Is it only after delete or there are other ocasions, or I'd better not count on this at all. StJ - Original Message - From: Vanlerberghe, Luc [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 24, 2005 4:07 PM Subject: RE: Custom filters document numbers An IndexReader will always see the same set of documents. Even if another process deletes some documents, adds new ones or optimizes the complete index, your IndexReader instance will not see those changes. If you detect that the Lucene index changed (e.g. by calling IndexReader.getCurrentVersion(...) once in a while), you should close and reopen your 'current' IndexReader and recalculate any data that relies on the Lucene document numbers. Regards, Luc. -Original Message- From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] Sent: donderdag 24 februari 2005 14:18 To: Lucene Users List Subject: Custom filters document numbers Given an IndexReader a custom filter is supposed to create a bit set, that maps each document numbers to {'visible', 'invisible'} On the other hand, it is stated that Lucene is allowed to change document numbers. Is it guaranteed that this BitSet's view of document numbers won't change while the BitSet is still in use (or perhaps the corresponding IndexReader is still opened) ? And another (more low-level) question. When Lucene may change document numbers? Is it only when the index is optimized after there has been a delete operation? Regards: StJ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re[2]: Is IndexSearcher thread safe?
Hello, Volodymyr. VB Additional question. VB If I'm sharing one instance of IndexSearcher between different threads VB Is it good to just to drop this instance to GC. VB Because I don't know if some thread is still using this searcher or done VB with it. It is safe to share one instance between many threads and it should be safe to drop old object to GC. But I have discovered one strange fact. When you have indexSearcher on big index, so IndexSearcher object takes a lot of memory (900Mb) and when you create new IndexSearcher after deletion of all references to old IndexSearcher then memory consumed my old IndexSearcher will not be ever freed. What can community answer on this strange fact? Yura Smolsky. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
Apologies Erik, This must be one of those apostrophe in email address problems I always get. Recently I removed the apostrophe from the email address I give out. Our server recognizes both email addresses, but some of these mail lists don't like the O'Connor clann! Ciao, Jonathan O'Connor XCOM Dublin Erik Hatcher [EMAIL PROTECTED] 01/03/2005 12:16 Please respond to Lucene Users List lucene-user@jakarta.apache.org To Lucene Users List lucene-user@jakarta.apache.org cc Subject Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft] I had to moderate both Jonathan and Jon's messages in to the list. Please subscribe to the list and post to it with the address you've subscribed. I cannot always guarantee I'll catch moderation messages and send them through in a timely fashion. Erik On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote: Jon, I too found some problems with the German analyser recently. Here's what may help: 1. You can try reading Joerg Caumanns' paper A Fast and Simple Stemming Algorithm for German Words. This paper describes the algorithm implemented by GermanAnalyser. 2. I guess German nouns all capitalized, so maybe that's why. Although you would want to be indexing well written German and not emails or text messages! 3. The German Stemmer converts umlauts into some funny form (the code is a bit tricky, and I didn't spend any time looking at it), so maybe thats why you can't find umlauts properly. I think the main reason for this umlaut change is that many plurals are formed by umlauting: E.g. Haus, Haeuser (that ae is a umlaut). Finally, to really understand what's happening, get your hands on Luke. I just got it last week, and its brilliant. It shows you everything about your indexes. You can also feed text to an Analyser, and see what it makes of it. This will show you the real reason why your umlaut search is failing. Ciao, Jonathan O'Connor XCOM Dublin Jon Humble [EMAIL PROTECTED] 01/03/2005 09:35 Please respond to Lucene Users List lucene-user@jakarta.apache.org To lucene-user@jakarta.apache.org cc Subject Questions about GermanAnalyzer/Stemmer [auf Viren geprueft] Hello, We?re using the GermanAnalyzer/Stemmer to index/search our (German) Website. I have a few questions: (1) Why is the GermanAnalyzer case-sensitive? None of the other language indexers seem to be. What does this feature add? (2) With the German Analyzer, wildcard searches containing extended German characters do not seem to work. So, a* is fine but anä* or ö* always find zero results. (3) In a similar vein to (2), wildcard searches with escaped special characters fail to find results. So a search for co\-operative works but a search for co\-op* fails. I will be grateful for any light that can be shed on these problems. With Thanks, Jon. Jon Humble BSc (hons,) Software Engineer eMail: [EMAIL PROTECTED] TecSphere Ltd Centre for Advanced Industry Coble Dene, Royal Quays Newcastle upon Tyne NE29 6DE United Kingdom Direct Dial: +44 (191) 270 31 06 Fax: +44 (191) 270 31 09 http://www.tecsphere.com *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen. This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt.
RE: help with boolean expression
I found something kind fo weird about the way lucene interprets boolean expressions wihout parenthesis. when i run the query A AND B OR C, it returns only the documents that have A(in other words as if the query was just the term A). when I run the query A OR B AND C, it returns only the documents that have B AND C(as if teh query was just B AND C ). I set the default operator in my application to be AND. can anyone explain this behavior, thanks. -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 2:40 AM To: Lucene Users List Subject: Re: help with boolean expression Omar Didi writes: I have a problem understanding how would lucene iterpret this boolean expression : A AND B OR C . it neither return the same count as when I enter (A AND B) OR C nor A AND (B OR C). if anyone knows how it is interpreted i would be thankful. thanks A AND B OR C creates a query that requires A and B. C influcenes the score, but is neither sufficient nor required for a match. IMO query parser is broken for queries mixing AND and OR without explicit braces. My favorite sample is `a AND b OR c AND d' which equals `a AND b AND c AND d' in query parser. I suggested a patch some time ago, but it's still pending in bugzilla. http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 Don't know if it's still usable with current sources. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Remove document fails
Hi, I have a problem doing IndexReader.delete(int doc) and it fails on lock error. Alex Kiselevski +9.729.776.4346 (desk) +9.729.776.1504 (fax) AMDOCS INTEGRATED CUSTOMER MANAGEMENT The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you.
RE: Is IndexSearcher thread safe?
Additional question. If I'm sharing one instance of IndexSearcher between different threads Is it good to just to drop this instance to GC. Because I don't know if some thread is still using this searcher or done with it. Note that as far as one of the threads keep a reference on the IndexSearcher it can not be garbage collected. Perhaps you meant that you do not know how a thread can declare that it does no more need the indexSearcher. To cope this that I created an IndexSercher pool. The pool contains a list of IndexSearchers and each one is associated with a counter. To get an IndexSearcher reference one must request it to the pool and then the counter is incremented. (To make it cleaner I had the idea to replace IndexSearcher references in the pool with proxy objects thus the pool will never distribute references of IndexSearchers to clients objects. The counter can be manage inside the proxy.) The pool has the ability to close and dereference an IndexSearcher when it is no more used (counter=0). Hope it helps. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Re[2]: Is IndexSearcher thread safe?
I probably had the same trouble (but I'm not sure). I have run a test programm that was creating a lot of IndexSearchers (but also close and free them). It went to an outOfMemory Exception. But i'm not finished with that problem (need to use a profiler). But I have discovered one strange fact. When you have indexSearcher on big index, so IndexSearcher object takes a lot of memory (900Mb) and when you create new IndexSearcher after deletion of all references to old IndexSearcher then memory consumed my old IndexSearcher will not be ever freed. What can community answer on this strange fact? Yura Smolsky. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Remove document fails
may be you have open IndexWriter at the same time you are trying to delete document. Alex Kiselevski wrote: Hi, I have a problem doing IndexReader.delete(int doc) and it fails on lock error. Alex Kiselevski +9.729.776.4346 (desk) +9.729.776.1504 (fax) AMDOCS INTEGRATED CUSTOMER MANAGEMENT The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zip Files
Hello; Anyone have an ideas on how to index the contents within zip files? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zip Files
Hello first, you need a parser for each file type: pdf, txt, word, etc. and use a java api to iterate zip content, see: http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html use getNextEntry() method little example: ZipInputStream zis = new ZipInputStream(fileInputStream); ZipEntry zipEntry; while(zipEntry = zis.getNextEntry() != null){ //use zipEntry to get name, etc. //get properly parser for current entry //use parser with zis (ZipInputStream) } good luck Ernesto Luke Shannon escribió: Hello; Anyone have an ideas on how to index the contents within zip files? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Ernesto De Santis - Colaborativa.net Córdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Large Index managing
Hi, just an idea how to manage large index that is updated very often. Very often there is need to update an document in index. To update document in index you should delete old document from index and then add new one. In most cases it require you to open IndexReader, delete document, close IndexReader, create IndexWriter, add document, close IndexWriter, and re-open IndexSearcher (if index is searched heavily). Profiling some applications I found that most time is spend in IndexReader.open() method. Also it produces many objects, so it also gives GC overhead. Idea to optimize this process is to create two indexes. One main index that could be very large and second index that will serve as change buffer. We can keep one IndexReader open for the first index. (and use it for searching and for deleting old documents). Second index is small and we can reopen IndexReader frequently when needed. when second index reaches some number of documents we can merge it with main index. to search this multi index we could use MultiSearcher over this two indexes but with little trick: first IndexSearcher is kept same during all time till second index is merged with main and second IndexSearcher is reopened when second index changes. It is just idea. (It is not tested) Will it help to improve speed of updating large index and lower memory overhead? Any comments? Regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zip Files
Thanks Ernesto. The issue I'm working with now (this is more lack of experience than anything) is getting an input I can index. All my indexing classes (doc, pdf, xml, ppt) take a File object as a parameter and return a Lucene Document containing all the fields I need. I'm struggling with how I can work with an array of bytes instead of a Java File. It would be easier to unzip the zip to a temp directory, parse the files and than delete the directory. But this would greatly slow indexing and use up disk space. Luke - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, March 01, 2005 10:48 AM Subject: Re: Zip Files Hello first, you need a parser for each file type: pdf, txt, word, etc. and use a java api to iterate zip content, see: http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html use getNextEntry() method little example: ZipInputStream zis = new ZipInputStream(fileInputStream); ZipEntry zipEntry; while(zipEntry = zis.getNextEntry() != null){ //use zipEntry to get name, etc. //get properly parser for current entry //use parser with zis (ZipInputStream) } good luck Ernesto Luke Shannon escribió: Hello; Anyone have an ideas on how to index the contents within zip files? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Ernesto De Santis - Colaborativa.net Córdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
Stanislav Jordanov wrote: startTs = System.currentTimeMillis(); dummyMethod(hits.doc(nHits - nHits)); stopTs = System.currentTimeMillis(); System.out.println(Last doc accessed in + (stopTs - startTs) + ms); 'nHits - nHits' always equals zero. So you're actually printing the first document, not the last. The last document would be accessed with 'hits.doc(nHits)'. Accessing the last document should not be much slower (or faster) than accessing the first. 200+ milliseconds to access a document does seem slow. Where is you index stored? On a local hard drive? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zip Files
Luke, Look at the javadocs for java.io.ByteArrayInputStream - it wraps a byte array and makes it accessible as an InputStream. Also see java.util.zip.ZipFile. You should be able to read and parse all contents of the zip file in memory. http://java.sun.com/j2se/1.4.2/docs/api/java/io/ByteArrayInputStream.html On Tue, 1 Mar 2005 12:39:17 -0500, Luke Shannon [EMAIL PROTECTED] wrote: Thanks Ernesto. I'm struggling with how I can work with an array of bytes instead of a Java File. It would be easier to unzip the zip to a temp directory, parse the files and than delete the directory. But this would greatly slow indexing and use up disk space. Luke - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, March 01, 2005 10:48 AM Subject: Re: Zip Files Hello first, you need a parser for each file type: pdf, txt, word, etc. and use a java api to iterate zip content, see: http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html use getNextEntry() method little example: ZipInputStream zis = new ZipInputStream(fileInputStream); ZipEntry zipEntry; while(zipEntry = zis.getNextEntry() != null){ //use zipEntry to get name, etc. //get properly parser for current entry //use parser with zis (ZipInputStream) } good luck Ernesto Luke Shannon escribió: Hello; Anyone have an ideas on how to index the contents within zip files? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Ernesto De Santis - Colaborativa.net Córdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Investingating Lucene For Project
I am looking for a solution to a problem I am having. We have a web-based asset management solution where we manage customers assets. We have had requests from some clients who would like the ability to index PDF files, now and possibly other text files in the future. The PDF files live on a server and are in a structured environment. I would like to somehow index the content inside the PDF and be able to run searches on that information from a web-form. The result MUST BE a text snippet (that being some text prior to the searched word and after the searched word). Does this make sense? And can Lucene do this? If the product can do this, how is the best way to get rolling on a project of this nature? Purchase an example book, or are there simple examples one can pick up on? Does Lucene have a large learning curve? or reasonably quick? If all the above will work, what kind of license does this require? I have not been able to find a link to that yet on the jakarta site. I sincerely appreciate any input into this. Sincerely Scott
Re: Investingating Lucene For Project
See inlined comments below. We have had requests from some clients who would like the ability to index PDF files, now and possibly other text files in the future. The PDF files live on a server and are in a structured environment. I would like to somehow index the content inside the PDF and be able to run searches on that information from a web-form. The result MUST BE a text snippet (that being some text prior to the searched word and after the searched word). Does this make sense? And can Lucene do this? Lucene indexes text documents, so you will need to convert your PDF to a text document. PDFBox (http://www.pdfbox.org/) can do that, PDFBox provides a summary of the document, which is just the first x number of characters. If you wanted a smarter summary you would need to create that yourself. If the product can do this, how is the best way to get rolling on a project of this nature? Purchase an example book, or are there simple examples one can pick up on? Does Lucene have a large learning curve? or reasonably quick? There are tutorials available on the website, and I would recommend the Lucene in Action book. There is a learning curve for lucene, but it sounds like your requirements are pretty basic so it shouldn't be that hard. If all the above will work, what kind of license does this require? I have not been able to find a link to that yet on the jakarta site. http://www.apache.org/licenses/LICENSE-2.0 Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Best Practices for Distributing Lucene Indexing and Searching
Lucene Users, We have a requirement for a new version of our software that it run in a clustered environment. Any node should be able to go down but the application must keep functioning. Currently, we use Lucene on a single node but this won't meet our fail over requirements. If we can't find a solution, we'll have to stop using Lucene and switch to something else, like full text indexing inside the database. So I'm looking for best practices on distributing Lucene indexing and searching. I'd like to hear from those of you using Lucene in a multi-process environment what is working for you. I've done some research, and based on on what I've seen so far, here's a bit of brainstorming on what seems to be possible: 1. Don't. Have a single indexing and searching node. [Note: this is the last resort.] 2. Don't distribute indexing. Searching is distributed by storing the index on NFS. A single indexing node would process all requests. However, using Lucene on NFS is *not* recommended. See: http://lucenebook.com/search?query=nfs ...it can result in stale NFS file handle problem: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12481.html So we'd have to investigate this option. Indexing could use an JMS queue so if the box goes down, when it comes back up, indexing could resume where it left off. 3. Distribute indexing and searching into separate indexes for each node. Combine results using ParallelMultiSearcher. If a box went down, a piece of the index would be unavailable. Also, there would be serious issues making sure assets are indexed in the right place to prevent duplicates, stale results, or deleted assets from showing up in the index. Another possibility would be a hashing scheme for indexing...assets could be put into buckets based on their IDs to prevent duplication. Keeping results consistent as you're changing the number of the buckets as the nodes come up and down would be a challenge though 4. Distribute indexing and searching, but index everything at each node. Each node would have a complete copy of the index. Indexing would be slower. We could move to a 5 or 15 minute batch approach. 5. Index centrally and push updated indexes to search nodes on a periodic basis. This would be easy and might avoid the problems with using NFS. 6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. 7. Create a JDBCDirectory implementation and let the database handle the clustering. A JDBCDirectory exists (http://ppinew.mnis.com/jdbcdirectory/), but has only been tested with MySQL. It would probably require modification (the code is under the LGPL). At one time, an OracleDirectory implementation existed but that was in 2000 and so it is surely badly outdated. But in principle, the concept is possible. However, these database-based directories are slower at indexing and searching than the traditional style, probably mostly due to BLOB handling. 8. Can the Berkely DB-based DBDirectory help us? I am not sure what advantages it would bring over the traditional FSDirectory, but maybe someone else has some ideas. Please let me know if you've got any other ideas or a best practice to follow. Thanks, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
Daniel Naber wrote: After fixing this I can reproduce the problem with a local index that contains about 220.000 documents (700MB). Fetching the first document takes for example 30ms, fetching the last one takes 100ms. Of course I tested this with a query that returns many results (about 50.000). Actually it happens even with the default sorting, no need to sort by some specific field. In part this is due to the fact that Hits first searches for the top-scoring 100 documents. Then, if you ask for a hit after that, it must re-query. In part this is also due to the fact that maintaining a queue of the top 50k hits is more expensive than maintaining a queue of the top 100 hits, so the second query is slower. And in part this could be caused by other things, such as that the highest ranking document might tend to be cached and not require disk io. One could perform profiling to determine which is the largest factor. Of these, only the first is really fixable: if you know you'll need hit 50k then you could tell this to Hits and have it perform only a single query. But the algorithmic cost of keeping the queue of the top 50k is the same as collecting all the hits and sorting them. So, in part, getting hits 49,990 through 50,000 is inherently slower than getting hits 0-10. We can minimize that, but not eliminate it. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple indexes
Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to manipulate the lucene index table
Hi all, I have a web-based application that we use to index text documents as well as images; the indexes fields are either Field.Unstored or Field.Keyword. Currently, we plan to modify some of the index field names. For example, if the index field name was DOCLOCALE, we plan to break it up into two fields: DOCUMENTTYPE and LOCALE. Since, the index files that lucene creates have become quite big (close to 1 gig), we are looking for a way to be able to read the index entries and modify them via a standalone Java program. Does lucene provide any APIs to read these index entries and update them? Is there an easy way to do it? Thanks in advance Srimant
Re: Multiple indexes
It's hard to answer such a general question with anything very precise, so sorry if this doesn't hit the mark. Come back with more details and we'll gladly assist though. First, certainly do not copy/paste code. Use standard reuse practices, perhaps the same program can build the two different indexes if passed different parameters, or share code between two different programs as a JAR. What specifically are the issues you're encountering? Erik On Mar 1, 2005, at 8:06 PM, Ben wrote: Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for Distributing Lucene Indexing and Searching
6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. This is a promising idea for handling a high update volume because it avoids all of the search nodes having to do the analysis phase. Unfortunately, the way addIndexes() is implemented looks like it's going to present some new problems: public synchronized void addIndexes(Directory[] dirs) throws IOException { optimize(); // start with zero or 1 seg for (int i = 0; i dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } optimize(); // final cleanup } We need to deal with some very large indexes (40G+), and an optimize rewrites the entire index, no matter how few documents were added. Since our strategy calls for deleting some docs on the primary index before calling addIndexes() this means *both* calls to optimize() will end up rewriting the entire index! The ideal behavior would be that of addDocument() - segments are only merged occasionally. That said, I'll throw out a replacement implementation that probably doesn't work, but hopefully will spur someone with more knowledge of Lucene internals to take a look at this. public synchronized void addIndexes(Directory[] dirs) throws IOException { // REMOVED: optimize(); for (int i = 0; i dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } maybeMergeSegments(); // replaces optimize } -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple indexes
Is it true that for each index I have to create a seperate instance for FSDirectory, IndexWriter and IndexReader? Do I need to create a seperate locking mechanism as well? I have already implemented a program using just one index. Thanks, Ben On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: It's hard to answer such a general question with anything very precise, so sorry if this doesn't hit the mark. Come back with more details and we'll gladly assist though. First, certainly do not copy/paste code. Use standard reuse practices, perhaps the same program can build the two different indexes if passed different parameters, or share code between two different programs as a JAR. What specifically are the issues you're encountering? Erik On Mar 1, 2005, at 8:06 PM, Ben wrote: Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for Distributing Lucene Indexing and Searching
Yonik Seeley wrote: 6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. This is a promising idea for handling a high update volume because it avoids all of the search nodes having to do the analysis phase. A clever way to do this is to take advantage of Lucene's index file structure. Indexes are directories of files. As the index changes through additions and deletions most files in the index stay the same. So you can efficiently synchronize multiple copies of an index by only copying the files that change. The way I did this for Technorati was to: 1. On the index master, periodically checkpoint the index. Every minute or so the IndexWriter is closed and a 'cp -lr index index.DATE' command is executed from Java, where DATE is the current date and time. This efficiently makes a copy of the index when its in a consistent state by constructing a tree of hard links. If Lucene re-writes any files (e.g., the segments file) a new inode is created and the copy is unchanged. 2. From a crontab on each search slave, periodically poll for new checkpoints. When a new index.DATE is found, use 'cp -lr index index.DATE' to prepare a copy, then use 'rsync -W --delete master:index.DATE index.DATE' to get the incremental index changes. Then atomically install the updated index with a symbolic link (ln -fsn index.DATE index). 3. In Java on the slave, re-open 'index' it when its version changes. This is best done in a separate thread that periodically checks the index version. When it changes, the new version is opened, a few typical queries are performed on it to pre-load Lucene's caches. Then, in a synchronized block, the Searcher variable used in production is updated. 4. In a crontab on the master, periodically remove the oldest checkpoint indexes. Technorati's Lucene index is updated this way every minute. A mergeFactor of 2 is used on the master in order to minimize the number of segments in production. The master has a hot spare. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple indexes
Ben, You do need to use a separate instance of those 3 classes for each index yes. But this is really something like: IndexWriter writer = new IndexWriter(); So it's normal code-writing process you don't really have to create anything new, just use existing Lucene API. As for locking, again you don't need to create anything. Lucene does have a locking mechanism, but most of it should be completely invisible to you if you follow the concurrency rules. I hope this helps. Otis --- Ben [EMAIL PROTECTED] wrote: Is it true that for each index I have to create a seperate instance for FSDirectory, IndexWriter and IndexReader? Do I need to create a seperate locking mechanism as well? I have already implemented a program using just one index. Thanks, Ben On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: It's hard to answer such a general question with anything very precise, so sorry if this doesn't hit the mark. Come back with more details and we'll gladly assist though. First, certainly do not copy/paste code. Use standard reuse practices, perhaps the same program can build the two different indexes if passed different parameters, or share code between two different programs as a JAR. What specifically are the issues you're encountering? Erik On Mar 1, 2005, at 8:06 PM, Ben wrote: Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
list moving to lucene.apache.org
This list is about to be moved to java-user at lucene.apache.org. Please excuse the temporary inconvenience. Cheers, Roy T. Fielding, co-founder, The Apache Software Foundation ([EMAIL PROTECTED]) http://roy.gbiv.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]