Software License
Hi all, I know Lucene is a free project, however I think its use is under Apache Software License (ASL) terms, so someone using Lucene should reference the project, use the logo 'powered by Lucene', ... I have suspects about a company releasing a commercial search engine based on Lucene and not mentioning Lucene at all. What kind of actions can we take to protect Open Source projects like Lucene of this kind of malicious use? Thanks,
RE: Googlifying lucene querys
In the Lucene build that we've got (2/21) the question mark does not do a single-character replace. Does anyone know why? We're using the StandardAnalyzer and the default QueryParser. -Original Message- From: Peter Carlson [mailto:[EMAIL PROTECTED]] Sent: Saturday, February 23, 2002 5:23 PM To: Lucene Users List Subject: Re: Googlifying lucene querys Hi Jari, Lucene is designed as an API with different components broken out so a developer can create the uniqueness required. One part of Lucene is the QueryParser. The QueryParser takes a search string and create a set of classes based on the current QueryParser.jj implementation and turns it into a Lucene Query. This is meant to be a good solution for most people, but it is just a sample of what can be done. In the current implementation of QueryParser 'george bush white house' Will create an OR query of George OR bush OR white house Basically, the default is an OR between words unless otherwise specified. You can use other boolean operators like AND, and NOT So 'george AND bush OR white house NOT ford' Lucene and the current QueryParser supports wildcards with the * character Single character replace with the ? Character Fuzzy searches with the ~ character when next to a single word term Proximity searches (just added to QueryParser) with the ~3 next to a phrase term Again, you can create your own QueryParser to create your desired implementation. I hope this helps. --Peter On 2/23/02 8:19 AM, Jari Aarniala [EMAIL PROTECTED] wrote: +george +bush +white +house Well, that's pretty obvious even for me :) If you have separate words, just tokenize the string and add a plus in front of each of the words. But what I'm trying to do here is this: Let's say I have a more complicated query, say 'george bush white house' There you have two separate words, george and bush and then white house enclosed in quotes. If I use a piece of simple tokenization code, the above query becomes +georbe +bush +white +house See what I mean? That won't work the way expected. Anyway, I'm still a bit confused the inner workings of Lucene, so maybe I'll come up with something myself. Jari Aarniala [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Googlifying lucene querys
If you put the title in a separate field from the contents, and search both fields, matches in the title will usually be stronger, without explicit boosting. This is because the scores are normalized by the length of the field, and the title tends to be much shorter than the contents. So even without boosting, title matches usually come before contents matches. Doug -Original Message- From: Spencer, Dave [mailto:[EMAIL PROTECTED]] Sent: Monday, February 25, 2002 10:22 AM To: Lucene Users List Subject: RE: Googlifying lucene querys I'm pretty sure google gives priority to the words appearing in the title and URL. I believe sect 4.2.5 says this here: http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzww w-db.stanf ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf from here: http://citeseer.nj.nec.com/brin98anatomy.html So you have to have Lucene store the title as a separate field. This is then what you'd have if like me you boost (the caret is boost) the title by *5 and the URL by *2: +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0 url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0 contents:white) +(title:house^5.0 url:house^2.0 contents:house) -Original Message- From: Ian Lea [mailto:[EMAIL PROTECTED]] Sent: Saturday, February 23, 2002 8:15 AM To: Lucene Users List Subject: Re: Googlifying lucene querys +george +bush +white +house -- Ian. Jari Aarniala wrote: Hello, Despite of the confusing subject ;) my question is simple. I'm just trying out Lucene for the first time and would like to know how one would go on implementing the search on the index with the same logic that Google uses. For example, if the user input is george bush white house, how do I easily construct a query that searches ALL of the words above? If I have understood correctly, passing the search string above to the queryParser creates a query that search for ANY of the words above. Thanks for any help, -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Googlifying lucene querys
You cannot, in general, structure a Lucene query such that it will yield the same document rankings that Google would for that (query, document set). The reason for this is that Google employs a scoring algorithm that includes information about the topology of the pages (i.e., how the pages are linked together). (An overview of what Google does in this regard may be found at http://www.google.com/technology/index.html .) Thus, in order to get Lucene to do what Google does, you'd have to rewrite large chunks of it. Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Mon, 25 Feb 2002, Spencer, Dave wrote: I'm pretty sure google gives priority to the words appearing in the title and URL. I believe sect 4.2.5 says this here: http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzwww-db.stanf ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf from here: http://citeseer.nj.nec.com/brin98anatomy.html So you have to have Lucene store the title as a separate field. This is then what you'd have if like me you boost (the caret is boost) the title by *5 and the URL by *2: +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0 url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0 contents:white) +(title:house^5.0 url:house^2.0 contents:house) -Original Message- From: Ian Lea [mailto:[EMAIL PROTECTED]] Sent: Saturday, February 23, 2002 8:15 AM To: Lucene Users List Subject: Re: Googlifying lucene querys +george +bush +white +house -- Ian. Jari Aarniala wrote: Hello, Despite of the confusing subject ;) my question is simple. I'm just trying out Lucene for the first time and would like to know how one would go on implementing the search on the index with the same logic that Google uses. For example, if the user input is george bush white house, how do I easily construct a query that searches ALL of the words above? If I have understood correctly, passing the search string above to the queryParser creates a query that search for ANY of the words above. Thanks for any help, -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Build index using RAMDirectory out of memory errors
I have been using Lucene for 3 weeks and it rules. The indexing process can be slow. So I searched the mailgroup archives and found example code using RAMDirectory to improve indexing speed. The example code I found was indexing 100,000 files at a time to the RAMDirectory before writing to disk. I tried indexing 10,000 files at a time to the RAMDirectory before writing to disk. This drastically improved indexing times but sometimes I get out of memory errors. I am indexing text files and adding 9 fields from an Oracle database. Environment: Solaris 2.8 with 1G of ram and 2G of swap Java 1.3.1 Lucene 1.2-rc4 Any ideas for eliminating the out of memory errors ? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Googlifying lucene querys
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] You cannot, in general, structure a Lucene query such that it will yield the same document rankings that Google would for that (query, document set). The reason for this is that Google employs a scoring algorithm that includes information about the topology of the pages (i.e., how the pages are linked together). (An overview of what Google does in this regard may be found at http://www.google.com/technology/index.html .) Thus, in order to get Lucene to do what Google does, you'd have to rewrite large chunks of it. I don't agree with your conclusion: you would not have to re-write much of Lucene to incorporate this sort of information. To my understanding, Google uses linking information as a factor in scoring. Thus every document in the index has a factor computed from its links that is multiplied into its score. Lucene already keeps a factor per document that is multiplied into its score, but one that is computed from the document's length, not its links. Thus, once one has computed link scores, to add them to Lucene we just need to permit applications to affect this factor, with something like a Document.setBoost(float) method. The representation of the per-document factor would also need to change a little internally. It is currently stored as a single byte, and multiplying in an arbitrary factor would cause overflow. But enlarging it to 16 bits would be a small change. So adding such a capability would require re-writing only a very small chunk of Lucene. Computing a link-based factor would also take some code, but that's writing, not re-writing. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Build index using RAMDirectory out of memory errors
java -Xmx1000m Sorry if you already tried resizing your heap. Actually with 1.3.1 you could go up above a gig, but really swapping aint gonna help much. Winton I have been using Lucene for 3 weeks and it rules. The indexing process can be slow. So I searched the mailgroup archives and found example code using RAMDirectory to improve indexing speed. The example code I found was indexing 100,000 files at a time to the RAMDirectory before writing to disk. I tried indexing 10,000 files at a time to the RAMDirectory before writing to disk. This drastically improved indexing times but sometimes I get out of memory errors. I am indexing text files and adding 9 fields from an Oracle database. Environment: Solaris 2.8 with 1G of ram and 2G of swap Java 1.3.1 Lucene 1.2-rc4 Any ideas for eliminating the out of memory errors ? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
is there any way to create and manage a controlled vocabulary in lucene?
subj? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Performance Tuning
You could try playing with a merge factor... Otis --- Aruna Raghavan [EMAIL PROTECTED] wrote: Hi, Are there any ways to finetune the CPU performance with Lucene? I know of the usage of optimize() calls but I am wondering if there are any other ways to improve the CPU time/Disk space performace. Thanks! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Build index using RAMDirectory out of memory errors
Have you tried different values for IndexWriter.mergeFactor? Setting it to 1000 gave me a 10* speed improvement on some large index some time ago. Not RAMDirectory though. Your mileage may vary. -- Ian. Kurt Vaag wrote: I have been using Lucene for 3 weeks and it rules. The indexing process can be slow. So I searched the mailgroup archives and found example code using RAMDirectory to improve indexing speed. The example code I found was indexing 100,000 files at a time to the RAMDirectory before writing to disk. I tried indexing 10,000 files at a time to the RAMDirectory before writing to disk. This drastically improved indexing times but sometimes I get out of memory errors. I am indexing text files and adding 9 fields from an Oracle database. Environment: Solaris 2.8 with 1G of ram and 2G of swap Java 1.3.1 Lucene 1.2-rc4 Any ideas for eliminating the out of memory errors ? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Build index using RAMDirectory out of memory errors
Thanks Winton, Thats what it was. I just assumed java would take all the 1G that it needed. Didn't realize the default was 64M. Also thanks for not saying RTFM (which I had done but didn't know what TF to do with the -Xmx option). -Kurt -Original Message- From: Winton Davies [mailto:[EMAIL PROTECTED]] Sent: Monday, February 25, 2002 12:22 PM To: Lucene Users List Subject: Re: Build index using RAMDirectory out of memory errors java -Xmx1000m Sorry if you already tried resizing your heap. Actually with 1.3.1 you could go up above a gig, but really swapping aint gonna help much. Winton I have been using Lucene for 3 weeks and it rules. The indexing process can be slow. So I searched the mailgroup archives and found example code using RAMDirectory to improve indexing speed. The example code I found was indexing 100,000 files at a time to the RAMDirectory before writing to disk. I tried indexing 10,000 files at a time to the RAMDirectory before writing to disk. This drastically improved indexing times but sometimes I get out of memory errors. I am indexing text files and adding 9 fields from an Oracle database. Environment: Solaris 2.8 with 1G of ram and 2G of swap Java 1.3.1 Lucene 1.2-rc4 Any ideas for eliminating the out of memory errors ? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Index Locked For Write
I am not a Lucene expert but I would like to understand the threading issues also, and I'm wondering if the following is true when using Lucene in a multithreaded application. I understand there are three modes for using IndexReader and IndexWriter: A- IndexReader for reading only, not deleting B- IndexReader for deleting (and reading) C- IndexWriter (for adding and optimizing) Any number of readers may be used concurrently in mode A. But for B and C the reader or writer may not be kept open for long periods. Write operations create a lock, and closing the reader or writer is the only way to release the lock. In theory a single writer could be kept open, but its lock will prevent deletions (which are performed with a separate reader). Therefore for B and C each set of changes should be made inside a synchronized block where the reader or writer is opened and closed. This prevents multiple writers (or readers used for deleting) from being open at once. The synchronization should be done on an object that identifies a particular index, e.g., on a global object if there is only one index. For example: class myindex { static final Object INDEX_LOCK = new Object(); void delete(int[] docs) { synchronized (INDEX_LOCK) { IndexReader reader = IndexReader.open(...); try { for (int i = 0; i docs.length; i++) { reader.delete(docs[i]); } } finally { reader.close(); } } } void add(Document[] docs) { synchronized (INDEX_LOCK) { IndexWriter writer = new IndexWriter(...); try { for (int i = 0; i docs.length; i++) { writer.add(docs[i]); } writer.optimize(); } finally { writer.close(); } } } } Of course there are other techniques for global locking such as 'static synchronized' methods. Locking on a separate object per index is the general case (where multiple indexes are present). Is this correct? Or should Lucene be waiting on the write lock instead of throwing an exception? mark -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Sunday, February 24, 2002 9:22 PM To: Lucene Users List Subject: RE: Index Locked For Write --- Howk, Michael [EMAIL PROTECTED] wrote: Out of curiosity, why didn't we need to close the writer in rc2 or rc3? When you suggest a synchronized keyword, are you suggesting that the writer is not inherently thread-safe? Do we need to write our own thread management on top of Lucene? Sorry, that might have been a wrong suggestion, IndexWriter (at least the add method) is supposed to be thread safe. Otis -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 4:07 PM To: Lucene Users List Subject: RE: Index Locked For Write You could use synchronized keyword and use IndexReader.isLocked() or something like that, no? Otis --- Howk, Michael [EMAIL PROTECTED] wrote: Thank you for your quick responses. But in our application, we're working in a transactional environment where multiple threads are accessing a single writer using the recommended singleton pattern. Since no thread has exclusive access to the writer, how can we have one thread arbitrarily decide to close the writer? Michael -Original Message- From: Mark Tucker [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 3:51 PM To: Lucene Users List Subject: RE: Index Locked For Write You forgot to close your writer after the call to optimize. -Original Message- From: Howk, Michael [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 2:49 PM To: Lucene Mailing List (E-mail) Subject: Index Locked For Write We just got the newest daily build (to try to fix some NullPointer errors with ? and _ characters), and we're getting the same problem that Daniel Calvo mentioned: Index Locked for Write. Here's basically what our code is doing: IndexWriter writer = new IndexWriter(path, analyzer, create); try { Document doc = new Document(); doc.add(Field.Keyword(DOC_ID, 14)); doc.add(Field.UnStored(ANY, mushu)); writer.addDocument(doc); writer.optimize(); // Search the document for our keyword { IndexReader reader = IndexReader.open(path); IndexSearcher searcher = new IndexSearcher(reader); Vector returnStuff = searcher.search(mushu); } // Verify that we got one record back assertNotNull(returnStuff); assertEquals(1, returnStuff.size()); } finally { // Clean up after ourselves IndexReader reader = IndexReader.open(path);
Re: is there any way to create and manage a controlled vocabularyin lucene?
Hi, Are you just trying to have Lucene index terms that are in your Vocaulary. If you, then you can great your own analyzer returns words in your vocabulary. Also, you could use the StandardAnalyzer, and then you could create your own Lucene Document and only add words that match your vocabulary. If you just want to see if it works, you might try to just add code on top of your own document. There are many examples of Lucene Documents. The HTMLDocument in the demo or just the text document. Hope this helps --Peter On 2/25/02 11:29 AM, Philipp Chudinov [EMAIL PROTECTED] wrote: subj? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]