Re: Permissioning Documents
On Friday 10 December 2004 07:10, Steve Skillcorn wrote: Hi; I'm currently using Lucene (which I am extremely impressed with BTW) to index a knowledge base of documents. One issue I have is that only certain documents are available to certain users (or groups). The number of documents is large, into the 100,000s, and the number of uses can be into the 1000s. Obviously, the users permissioned to see certain documents can change regularly, so storing the user id's in the Lucene document is undesirable, as a permission change could mean a delete and re-add to potentially 100s of documents. Does anyone have any guidance as to how I should approach this? A typical solution would be to use a Filter for each user group. Each Filter would be built from categories indexed with the documents. The moment to build a group Filter could be the first time a user from a group queries an index after it is opened. Filters can be cached, see the recent discussion on CachingWrappingFilter and friends. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene in Action e-book now available!
The Lucene in Action e-book is now available at Manning's site: http://www.manning.com/hatcher2 Manning also put lots of other goodies there, the table of contents, about this book, preface, the foreward from Doug Cutting himself (thanks Doug!!!), and a couple of sample chapters. The complete source code is there as well. Now comes the exciting part to find out what others think of the work Otis and I spent 14+ months of our lives on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Permissioning Documents
Hi Steve, Possibly the easiest way to handle this is to tag the documents with a field listing the permitted roles/groups (not the individual users). I would be tempted to keep the information that associates users to groups outside of the Lucene index eg in a relational DB. This way you do not need to worry about updating the Lucene index everytime a new user is added or is granted membership to a group. When you search, simply use a QueryFilter which lists the current user's roles e.g. groups:(admin, projectManager) - this will restrict the search results to only those docs associated with the user's roles. Cheers Mark ___ Win a castle for NYE with your mates and Yahoo! Messenger http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HITCOLLECTOR+SCORE+DELIMA
Hi guys Apologies. I am still in delima on How to use the HitCollector for returning Hits hits between scores 0.2f to 1.0f , There is not a simple example for the same, yet lot's of talk on usage for the same on the form. Please somebody spare a bit of code (u'r intelligence) on this form. Thx in advance Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene in Action e-book now available!
Am I the first one who bought the Lucene in Action book ? Thanks Erik and Otis. William W. Silva From: Erik Hatcher [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene User [EMAIL PROTECTED],Lucene List [EMAIL PROTECTED] Subject: Lucene in Action e-book now available! Date: Fri, 10 Dec 2004 03:52:55 -0500 The Lucene in Action e-book is now available at Manning's site: http://www.manning.com/hatcher2 Manning also put lots of other goodies there, the table of contents, about this book, preface, the foreward from Doug Cutting himself (thanks Doug!!!), and a couple of sample chapters. The complete source code is there as well. Now comes the exciting part to find out what others think of the work Otis and I spent 14+ months of our lives on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Dont just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMA
On Dec 10, 2004, at 7:39 AM, Karthik N S wrote: I am still in delima on How to use the HitCollector for returning Hits hits between scores 0.2f to 1.0f , There is not a simple example for the same, yet lot's of talk on usage for the same on the form. Unfortunately there isn't a clean way to stop a HitCollector - it will simply collect all hits. Also, scores are _not_ normalized when passed to a HitCollector, so you may get scores 1.0. Hits, however, does normalize and you're guaranteed that scores will be = 1.0. Hits are in descending score order, so you may just want to use Hits and filter based on the score provided by hits.score(i). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SEARCH +HITS+LIMIT
On Dec 10, 2004, at 8:24 AM, Andraz Skoric wrote: Displaytag (http://displaytag.sourceforge.net/) is for displaying search results in multiple pages I don't know displaytag internals, but be cautious with such things. What you do not want to happen is all the results to be grabbed and cached somehow. You only want to retrieve the actual documents being shown on that specific page. It looks like displaytag can support this, as long as you provide your own custom pruned document set. Personally, I use Tapestry :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action e-book now available!
Nice Work! Congratulations Guys. - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene User [EMAIL PROTECTED]; Lucene List [EMAIL PROTECTED] Sent: Friday, December 10, 2004 3:52 AM Subject: Lucene in Action e-book now available! The Lucene in Action e-book is now available at Manning's site: http://www.manning.com/hatcher2 Manning also put lots of other goodies there, the table of contents, about this book, preface, the foreward from Doug Cutting himself (thanks Doug!!!), and a couple of sample chapters. The complete source code is there as well. Now comes the exciting part to find out what others think of the work Otis and I spent 14+ months of our lives on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action e-book now available!
Congrats ! i went through sample chapter 1 . well written . On Fri, 10 Dec 2004 09:58:25 -0500, Luke Shannon [EMAIL PROTECTED] wrote: Nice Work! Congratulations Guys. - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene User [EMAIL PROTECTED]; Lucene List [EMAIL PROTECTED] Sent: Friday, December 10, 2004 3:52 AM Subject: Lucene in Action e-book now available! The Lucene in Action e-book is now available at Manning's site: http://www.manning.com/hatcher2 Manning also put lots of other goodies there, the table of contents, about this book, preface, the foreward from Doug Cutting himself (thanks Doug!!!), and a couple of sample chapters. The complete source code is there as well. Now comes the exciting part to find out what others think of the work Otis and I spent 14+ months of our lives on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Robin 9886394650 The merit of an action lies in finishing it to the end - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryError with Lucene 1.4 final
You probably need to increase the amount of RAM available to your JVM. See the parameters: -Xmx :Maximum memory usable by the JVM -Xms :Initial memory allocated to JVM My params are; -Xmx2048m -Xms128m (2G max, 128M initial) On Fri, 10 Dec 2004 11:17:29 -0600, Sildy Augustine [EMAIL PROTECTED] wrote: I think you should close your files in a finally clause in case of exceptions with file system and also print out the exception. You could be running out of file handles. -Original Message- From: Jin, Ying [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 11:15 AM To: [EMAIL PROTECTED] Subject: OutOfMemoryError with Lucene 1.4 final Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryError with Lucene 1.4 final
I am not sure. But guess there are three possilities, (1). see that you use Field.Text(contents, stringBuffer.toString()) This will store all your string of text into document object. And it might be long ... I do not know the detail how Lucene implemented. I think you can try use unstored first to see if the same problem happen. BTW, how large is your document. Mine has 1M docs and max-length less than 1 M, usually has length about several k. (2) I guess another possiblilty is that record 898 is a very long document, maybe java' s string object has a maxlength? Just trace the code, see when the exception occur. (3) Moreover, if you run it on a java VM, it also has a setting of its virtual mem. It has nothing to do with the hardware you are running. I has met this before when I use the directory's ListOfFile function, where it easily exceed the max mem, if there are 1M docs under the same dir (a stupid mistake I made). But if I expand the VM's mem, it is then appears ok. :) On Fri, 10 Dec 2004, Jin, Ying wrote: Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sorting tokenized field
I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: sorting tokenized field
On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
I was only thinking in terms of other search engines. I worked with other search engines and I didn't see this requirements before. I think you are right that its wasteful to duplicate all tokenized fields. Not sure if there is a smart of dealing with it. Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, December 10, 2004 1:53 PM Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryError with Lucene 1.4 final
Great!!! It works perfect after I setup -Xms and -Xmx JVM command-line parameters with: java -Xms128m -Xmx128m It turns out that my JVM is running out of memory. And Otis is right on my reader closing too. reader.close() will close the reader and release any system resources associated with it. I really appreciate everyone's help! Ying
No of docs using IndexSearcher
How do I get the number of docs in an index If I just have access to a searcher on that index? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: No of docs using IndexSearcher
numDocs() http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#numDocs() Ravi said the following on 12/10/2004 2:42 PM: How do I get the number of docs in an index If I just have access to a searcher on that index? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: No of docs using IndexSearcher
If your index is open shouldnt there be an instance of IndexReader already there? Ravi said the following on 12/10/2004 3:13 PM: I already have a field with a constant value in my index. How about using IndexSearcher.docFreq(new Term(field,value))? Then I don't have to instantiate IndexReader. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:59 PM To: Lucene Users List Subject: Re: No of docs using IndexSearcher numDocs() http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexR eader.html#numDocs() Ravi said the following on 12/10/2004 2:42 PM: How do I get the number of docs in an index If I just have access to a searcher on that index? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: No of docs using IndexSearcher
I'm fairly new to lucene. The main reason why I did n't use the IndexReader constructor for the searcher is we organize the indexes as different partitions depending on document's date and during searching I instantiate a MultiSearcher object on these different partitions depending on from-date and to-date from the search. I was getting a runtime exception during search, If the index does not have any documents. That's why I was looking for some method on the searcher object that gives me the number of documents. Thanks Ravi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 3:25 PM To: Lucene Users List Subject: Re: No of docs using IndexSearcher If your index is open shouldnt there be an instance of IndexReader already there? Ravi said the following on 12/10/2004 3:13 PM: I already have a field with a constant value in my index. How about using IndexSearcher.docFreq(new Term(field,value))? Then I don't have to instantiate IndexReader. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:59 PM To: Lucene Users List Subject: Re: No of docs using IndexSearcher numDocs() http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/Index R eader.html#numDocs() Ravi said the following on 12/10/2004 2:42 PM: How do I get the number of docs in an index If I just have access to a searcher on that index? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
Since I am not aware of the lucene code much, I couldn't make much out of your patch. But is this patch already tested and proved to be efficient? If so, why can't it be merge into the lucene code and made it part of the release. I think the bug is valid. Its very likely that people want to sort on tokenized fields. If I apply this patch to lucene code and use it for myself, I will have hard time managing it in future (while upgrading lucene library). If the pathc is applied to lucene release code, it would be very easy for the lucene users. If possible, can someone explain what the path does? I am trying to understand what exactly changed but could not figrue out. Praveen - Original Message - From: Aviran [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:30 PM Subject: RE: sorting tokenized field I have suggested a solution for this problem ( http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the patch suggested there and recompile lucene. Aviran http://www.aviransplace.com -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 13:53 PM To: Lucene Users List Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sorting based on calculations at search time
Hello, I'd like some suggestions on the following scenario. Say I have an index with a stored, indexed field called 'weight'(essentially an int stored as string). I'd like to sort in descending order of final weight, the search results by performing a calculation involving the lucene score for each hits. For our discussion, the calculation can be as simple as multiplying the lucene score with the value from the field 'weight' to get final weight. The search results can run into thousands of documents. Though finally I may need only the top X number of documents, I wouldn't know what the top X would be until I perform this calculation and sort it. The obvious way is to do a post processing of the hits iterator, storing it in memory, performing this calculation and sorting it. Is there any other better solution for this? Thanks, Guru. * Gurukeerthi Gurunathan Third Pillar Systems San Mateo, CA 650-372-1200x229
Re: Lucene in Action e-book now available!
Congratulations on the book. I ordered my copy the other day via regular post and am eagerly awaiting it. It looks like it will make lucene available to a much wider audience. Based on the table of contents, I wanted to toss out a couple of ideas for your next book or articles. 1. I didn't see any examples of indexing a database table. Although it was mentioned in Chapter 1. At the company I am currently consulting at, we index the data from the database because its cleaner than indexing the web. This discussion should include why you would want to use lucene to index a database table, rather than just using the database indexes. (The top reasons we choose to use Lucene instead of just database indexes are: It allows stem word recognition; It allows fuzzy searching; It ranks the results based on how good the match is; It contains a parser that will parse natural language queries; It has better Analyzers) 2. This one is a cookbook idea, I think it would be possible to index the access log of web server. Than when a user views product X the searcher could search for a other products that were viewed by people that also looked at product X. In this way you can create basic cross-selling opportunities. This feature is a big seller to managers for commercial search offerings. 3. A lot of search applications being built using lucene are web applications. I didn't see any reference to the two different strategies for paging a hit list. The two strategies are repeating the search and caching a search. An example of this would be good. [I know that I have seen this online, its just nice to have a reference in book form] Please don't take this as criticism. First of all, because I have not read the book. Secondly, I am excluding the other 17 topics that I thought should be in a book (for example, indexing PDFs, highlighting search results, create a thesaurus, suggesting alternatives spellings, filtering by ACLs, etc...) because they are clearly in your table of contents. I look forward to reading the book and appreciate your 14+ months of hard work to create a concise but valuable book for Lucene. Jonathan On Fri, 10 Dec 2004 03:52:55 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: The Lucene in Action e-book is now available at Manning's site: http://www.manning.com/hatcher2 Manning also put lots of other goodies there, the table of contents, about this book, preface, the foreward from Doug Cutting himself (thanks Doug!!!), and a couple of sample chapters. The complete source code is there as well. Now comes the exciting part to find out what others think of the work Otis and I spent 14+ months of our lives on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting based on calculations at search time
Thanks Otis for your response and compliments (wish I was a lucene guru like you guys :-) I believe you are talking about the boost factor for fields or documents while searching. That does not apply in my case - maybe I am missing a point here. The weight field I was talking about is only for the calculation purpose, I am not searching on that field (it can be just a stored, unindexed field). The main searching happens on other fields(like title, keywords etc.) for which I am already using some boost factor. The problem starts after I search and got some set of results - all I want here is the result to be ordered by a number that is a multiplication of lucene score and the weight field value for each document. I understand that without iterating thru the hits I cannot retrieve the score and the weight for each document - which is why I'd like this calculation and ordering to happen while searching so that I can avoid the iteration over the entire hits. If it involves working on the lucene source code, please point me to the right class or package that I should be dealing with. Thanks again, Guru. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 3:13 PM To: Lucene Users List Subject: Re: Sorting based on calculations at search time Guru (I thought my first name was OK until now), Have you tried using boosts for that? You can boost individual Document Fields when indexing, and/or you can boost individual Documents, thus giving some more and some less 'weight', which will have an effect on the final score. Otis --- Gurukeerthi Gurunathan [EMAIL PROTECTED] wrote: Hello, I'd like some suggestions on the following scenario. Say I have an index with a stored, indexed field called 'weight'(essentially an int stored as string). I'd like to sort in descending order of final weight, the search results by performing a calculation involving the lucene score for each hits. For our discussion, the calculation can be as simple as multiplying the lucene score with the value from the field 'weight' to get final weight. The search results can run into thousands of documents. Though finally I may need only the top X number of documents, I wouldn't know what the top X would be until I perform this calculation and sort it. The obvious way is to do a post processing of the hits iterator, storing it in memory, performing this calculation and sorting it. Is there any other better solution for this? Thanks, Guru. * Gurukeerthi Gurunathan Third Pillar Systems San Mateo, CA 650-372-1200x229 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
A simple Query Language
Hi, I am going to implement a search service and plan to use Lucene. Is there any simple query language that is independent of any particular search engine out there? Thanks Dongling If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited. Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion. For more information about Hyperion, please visit our Web site at www.hyperion.com
RE: A simple Query Language
You could support only terms with no operators at all, which will work in most search engines (except those that require combining operators). Using just terms and phrases embedded in 's is pretty universal. After that, you might want to add +/- required/prohibited restrictions, which many engines support. After that, I think you're getting pretty specific. Lucene supports all of these and many more. Chuck -Original Message- From: Dongling Ding [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 5:08 PM To: Lucene Users List Subject: A simple Query Language Hi, I am going to implement a search service and plan to use Lucene. Is there any simple query language that is independent of any particular search engine out there? Thanks Dongling If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited. Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion. For more information about Hyperion, please visit our Web site at www.hyperion.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page
Very cool, thanks for posting this! Google's feature doesn't seem to do a search on every keystroke necessarily. Instead, it waits until you haven't typed a character for a short period (I'm guessing about 100 or 150 milliseconds). So if you type fast, it doesn't hit the server until you pause. There are some more detailed postings on slashdot about how it works. On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer [EMAIL PROTECTED] wrote: Google just came out with a page that gives you feedback as to how many pages will match your query and variations on it: http://www.google.com/webhp?complete=1hl=en I had an unexposed experiment I had done with Lucene a few months ago that this has inspired me to expose - it's not the same, but it's similar in that as you type in a query you're given *immediate* feedback as to how many pages match. Try it here: http://www.searchmorph.com/kat/isearch.html This is my SearchMorph site which has an index of ~90k pages of open source javadoc packages. As you type in a query, on every keystroke it does at least one Lucene search to show results in the bottom part of the page. It also gives spelling corrections (using my NGramSpeller contribution) and also suggests popular tokens that start the same way as your search query. For one way to see corrections in action, type in rollback character by character (don't do a cut and paste). Note that: -- this is not how the Google page works - just similar to it -- I do single word suggestions while google does the more useful whole phrase suggestions (TBD I'll try to copy them) -- They do lots of javascript magic, whereas I use old school frames mostly -- this is relatively expensive, as it does 1 query per character, and when it's doing spelling correction there is even more work going on -- this is just an experiment and the page may be unstable as I fool w/ it What's nice is when you get used to immediate results, going back to the batch way of searching seems backward, slow, and old fashioned. There are too many idle CPUs in the world - this is one way to keep them busier :) -- Dave PS Weblog entry updated too: http://www.searchmorph.com/weblog/index.php?id=26 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting based on calculations at search time
: I believe you are talking about the boost factor for fields or documents : while searching. That does not apply in my case - maybe I am missing a : point here. : The weight field I was talking about is only for the calculation Otis is suggesting that you set the boost of the document to be your weight value. That way Lucene will automaticly do your multiplucation calculation when determining the score The down side of this, is that i don't think there's anyway to keep it from influencing the score on every search, so it's not something you could use only on some queries. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SEARCH +HITS+LIMIT
Displaytag (http://displaytag.sourceforge.net/) is for displaying search results in multiple pages lp, a Karthik N S wrote: Hi Guy's Apologies... One question for the form [ Especially Erik] 1) I have a MERGED Index with 100,000 File Indexed into it ( Content is one of the Fields of Type 'Text' ) 2) On search for a simple words Camera returns me 6000 hits. 3) Since the Search process is via WebApps , a simple JSP is used to display the Content. Question How to Display the Contents for the Hits in Incremental order ? [ Each Time a re hit to the Mergerindex with Incremental X value ]. This would solve the problem of Out of Memory by prefetching all the hit in one strait go process. Ex: Total hits 6000 1st page - hit's returned (1 to 25) 2nd page - hit's returned (26 to 50) . . . . N th page hit's returned ( 5975 - 6000 ) Hint : - This is similar to a SQL query SELECT * FROM LUCENE LIMIT 10, 5 WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryError with Lucene 1.4 final
I think you should close your files in a finally clause in case of exceptions with file system and also print out the exception. You could be running out of file handles. -Original Message- From: Jin, Ying [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 11:15 AM To: [EMAIL PROTECTED] Subject: OutOfMemoryError with Lucene 1.4 final Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
OutOfMemoryError with Lucene 1.4 final
Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying
RE: OutOfMemoryError with Lucene 1.4 final
Ying, You should follow this finally block advice below. In addition, I think you can just close the reader, and it will close the underlying stream (I'm not sure about that, double-check it). You are not running out of file handles, though. Your JVM is running out of memory. You can play with: 1) -Xms and -Xmx JVM command-line parameters 2) IndexWriter's parameters: mergeFactor and minMergeDocs - check the Javadocs for more info. They will let you control how much memory your indexing process uses. Otis --- Sildy Augustine [EMAIL PROTECTED] wrote: I think you should close your files in a finally clause in case of exceptions with file system and also print out the exception. You could be running out of file handles. -Original Message- From: Jin, Ying [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 11:15 AM To: [EMAIL PROTECTED] Subject: OutOfMemoryError with Lucene 1.4 final Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryError with Lucene 1.4 final
Ok, I see. Seems most ppl think is the third possiblity On Fri, 10 Dec 2004, Xiangyu Jin wrote: I am not sure. But guess there are three possilities, (1). see that you use Field.Text(contents, stringBuffer.toString()) This will store all your string of text into document object. And it might be long ... I do not know the detail how Lucene implemented. I think you can try use unstored first to see if the same problem happen. BTW, how large is your document. Mine has 1M docs and max-length less than 1 M, usually has length about several k. (2) I guess another possiblilty is that record 898 is a very long document, maybe java' s string object has a maxlength? Just trace the code, see when the exception occur. (3) Moreover, if you run it on a java VM, it also has a setting of its virtual mem. It has nothing to do with the hardware you are running. I has met this before when I use the directory's ListOfFile function, where it easily exceed the max mem, if there are 1M docs under the same dir (a stupid mistake I made). But if I expand the VM's mem, it is then appears ok. :) On Fri, 10 Dec 2004, Jin, Ying wrote: Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: sorting tokenized field
I have suggested a solution for this problem ( http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the patch suggested there and recompile lucene. Aviran http://www.aviransplace.com -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 13:53 PM To: Lucene Users List Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiSearcher close
If I close a MultiSearcher, does it close all the associated searchers too? I was getting a bad file descriptor error, if I close the MultiSearcher object and open it again for another search without reinstantiating the underlying searchers. Thanks in advance, Ravi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiSearcher close
On Dec 10, 2004, at 4:16 PM, Ravi wrote: If I close a MultiSearcher, does it close all the associated searchers too? It sure does: public void close() throws IOException { for (int i = 0; i searchables.length; i++) searchables[i].close(); } I was getting a bad file descriptor error, if I close the MultiSearcher object and open it again for another search without reinstantiating the underlying searchers. Thanks in advance, Ravi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Incremental Search experiment with Lucene, sort of like the new Google Suggestion page
Google just came out with a page that gives you feedback as to how many pages will match your query and variations on it: http://www.google.com/webhp?complete=1hl=en I had an unexposed experiment I had done with Lucene a few months ago that this has inspired me to expose - it's not the same, but it's similar in that as you type in a query you're given *immediate* feedback as to how many pages match. Try it here: http://www.searchmorph.com/kat/isearch.html This is my SearchMorph site which has an index of ~90k pages of open source javadoc packages. As you type in a query, on every keystroke it does at least one Lucene search to show results in the bottom part of the page. It also gives spelling corrections (using my NGramSpeller contribution) and also suggests popular tokens that start the same way as your search query. For one way to see corrections in action, type in rollback character by character (don't do a cut and paste). Note that: -- this is not how the Google page works - just similar to it -- I do single word suggestions while google does the more useful whole phrase suggestions (TBD I'll try to copy them) -- They do lots of javascript magic, whereas I use old school frames mostly -- this is relatively expensive, as it does 1 query per character, and when it's doing spelling correction there is even more work going on -- this is just an experiment and the page may be unstable as I fool w/ it What's nice is when you get used to immediate results, going back to the batch way of searching seems backward, slow, and old fashioned. There are too many idle CPUs in the world - this is one way to keep them busier :) -- Dave PS Weblog entry updated too: http://www.searchmorph.com/weblog/index.php?id=26 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]