Re: range and content query
Chris Fraschetti writes: can someone assist me in building or deny the possibility of combing a range query and a standard query? say for instance i have two fields i'm searching on... one being the a field with an epoch date associated with the entry, and the content so how can I make a query to select a range of thos epochs, as well as search through the content? can it be done in one query, or do I have to perform a query upon a query, and if so, what might the syntax look like? if you create the query using the API use a boolean query to combine the two basic queries. If you use query parser use AND or OR. Note that range queries are expanded into boolean queries (OR combined) which may be a problem if the number of terms matching the range is too big. Depends on your date entries and especially how precise they are. Alternatively you might consider using a filter. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors from the indexer. I discovered that restarting tomcat clears the problem. (Note that I'm recreating the index completely, not updating.) I've spent the last couple of hours trolling the archives and I've found numerous references to windows problems with open files. Is there a fix for this? How can I force the files to close? What's the best work-around? Many thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: range and content query
I've more or less figured out the query string required to get a range of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for the sake of this example) ... my query has results that I don't understand. if i do from 0 TO 10, then I only get results matching 0,1,10 ... if i do 0 TO 8, i get all results ... from 0 to 10... if i do 1 TO 5 ... then i get results 1,2,3,4,5,10 ... very strange. here is how my query looks... query: +date_field:[1 TO 5] here is how the date was added... Document doc = new Document(); doc.add(Field.UnIndexed(arcpath_field, filename)); doc.add(Field.Keyword(date_field, date)); doc.add(Field.Text(content_field, content)); writer.addDocument(doc); I tried Field.Text for the date and also received the same results. Essentially I have a loop to add 11 strings... indexes 0 to 10... and add doc0, 0, some text for each.. and the results i get as as explained above... any ideas? Here is my simple searching code.. i'm currently not searching for any text... i just want to test the range feature right now query_string = +(+DATE_FIELD+:[+start_date+ TO +end_date+]); Searcher searcher = new IndexSearcher(index_path); QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR); Query query = parser.parse(query_string); System.out.println(query: +query.toString()); Hits hits = searcher.search(query); On Mon, 20 Sep 2004 08:24:17 +0200, Morus Walter [EMAIL PROTECTED] wrote: Chris Fraschetti writes: can someone assist me in building or deny the possibility of combing a range query and a standard query? say for instance i have two fields i'm searching on... one being the a field with an epoch date associated with the entry, and the content so how can I make a query to select a range of thos epochs, as well as search through the content? can it be done in one query, or do I have to perform a query upon a query, and if so, what might the syntax look like? if you create the query using the API use a boolean query to combine the two basic queries. If you use query parser use AND or OR. Note that range queries are expanded into boolean queries (OR combined) which may be a problem if the number of terms matching the range is too big. Depends on your date entries and especially how precise they are. Alternatively you might consider using a filter. HTH Morus -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: range and content query
Chris Fraschetti writes: I've more or less figured out the query string required to get a range of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for the sake of this example) ... my query has results that I don't understand. if i do from 0 TO 10, then I only get results matching 0,1,10 ... if i do 0 TO 8, i get all results ... from 0 to 10... if i do 1 TO 5 ... then i get results 1,2,3,4,5,10 ... very strange. that's not strange. Lucene indexes strings and compares strings. Not numbers. So the order is 1 10 101 11 2 20 21 3 4 and so on I't up to you to make your number look a way that it will work, e.g. use leading '0' to get 001 002 003 004 010 011 020 021 ... I think there's a page in the wiki about these issues. here is how my query looks... query: +date_field:[1 TO 5] here is how the date was added... Document doc = new Document(); doc.add(Field.UnIndexed(arcpath_field, filename)); doc.add(Field.Keyword(date_field, date)); doc.add(Field.Text(content_field, content)); writer.addDocument(doc); I tried Field.Text for the date and also received the same results. Essentially I have a loop to add 11 strings... indexes 0 to 10... and add doc0, 0, some text for each.. and the results i get as as explained above... any ideas? Here is my simple searching code.. i'm currently not searching for any text... i just want to test the range feature right now query_string = +(+DATE_FIELD+:[+start_date+ TO +end_date+]); Searcher searcher = new IndexSearcher(index_path); QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR); Query query = parser.parse(query_string); System.out.println(query: +query.toString()); Hits hits = searcher.search(query); It's a bad practice to create search strings that have to be decomposed by query parser again, if have the parts already at hand. At least in most cases. I don't know the details how and when query parser will call the analyzer and what standard analyzer does with numbers. What does query.toString() output? But the main problem seems to be your misunderstanding of searching numbers in lucene. They are just strings and are treated by their lexical representation not their numeric value. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: range and content query
very correct you are. changing the format of the numbers when i index then and when i do the range fixed my problem.. thanks much. On Mon, 20 Sep 2004 09:08:50 +0200, Morus Walter [EMAIL PROTECTED] wrote: Chris Fraschetti writes: I've more or less figured out the query string required to get a range of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for the sake of this example) ... my query has results that I don't understand. if i do from 0 TO 10, then I only get results matching 0,1,10 ... if i do 0 TO 8, i get all results ... from 0 to 10... if i do 1 TO 5 ... then i get results 1,2,3,4,5,10 ... very strange. that's not strange. Lucene indexes strings and compares strings. Not numbers. So the order is 1 10 101 11 2 20 21 3 4 and so on I't up to you to make your number look a way that it will work, e.g. use leading '0' to get 001 002 003 004 010 011 020 021 ... I think there's a page in the wiki about these issues. here is how my query looks... query: +date_field:[1 TO 5] here is how the date was added... Document doc = new Document(); doc.add(Field.UnIndexed(arcpath_field, filename)); doc.add(Field.Keyword(date_field, date)); doc.add(Field.Text(content_field, content)); writer.addDocument(doc); I tried Field.Text for the date and also received the same results. Essentially I have a loop to add 11 strings... indexes 0 to 10... and add doc0, 0, some text for each.. and the results i get as as explained above... any ideas? Here is my simple searching code.. i'm currently not searching for any text... i just want to test the range feature right now query_string = +(+DATE_FIELD+:[+start_date+ TO +end_date+]); Searcher searcher = new IndexSearcher(index_path); QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR); Query query = parser.parse(query_string); System.out.println(query: +query.toString()); Hits hits = searcher.search(query); It's a bad practice to create search strings that have to be decomposed by query parser again, if have the parts already at hand. At least in most cases. I don't know the details how and when query parser will call the analyzer and what standard analyzer does with numbers. What does query.toString() output? But the main problem seems to be your misunderstanding of searching numbers in lucene. They are just strings and are treated by their lexical representation not their numeric value. Morus -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors from the indexer. I discovered that restarting tomcat clears the problem. (Note that I'm recreating the index completely, not updating.) I've spent the last couple of hours trolling the archives and I've found numerous references to windows problems with open files. Is there a fix for this? How can I force the files to close? What's the best work-around? Many thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Fred, I won't get into the details here, but you shouldn't (have to) open a new IndexReader/Searcher on each request (I'll assume the code below is from your Actions'e xecute method). You should cache and re-use IndexReaders (and IndexSearchers). There may be a FAQ entry regarding that, I'm not sure. Closing them on every request is also something you shouldn't do (opening and closing them is, in simple terms, just doing too much work. Open N files, read them, close them. Open N files, read them, close them. And so on) Regarding failing deletion, that's a Windows OS thing - it won't let you remove a file while another process has it open. I am not certain where exactly this error comes from in Lucene (exception stack trace?), but I thought the Lucene code included work-arounds for this. Otis --- Fred Toth [EMAIL PROTECTED] wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors from the indexer. I discovered that restarting tomcat clears the problem. (Note that I'm recreating the index completely, not updating.) I've spent the last couple of hours trolling the archives and I've found numerous references to windows problems with open files. Is there a fix for this? How can I force the files to close? What's the best work-around? Many thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Hi Fred, That's right, there are many references to this kind of problems in the lucene-user list. This suggestions were already made, but I'll list them once again: 1. One way to use the IndexSearcher is to use yopur code, but I don't encourage users to do that IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); It's better to use the constructor that uses a String to create a IndexSearcher. |*IndexSearcher http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I even suggest that the path to be obtained as File indexFolder = new File(luceneIndex); IndexSearcher searcher = new IndexSearcher(indexFolder.toString()). 2. I can imagine situations when the lucene index must be created at each startup, but I think that this is very rare, so I suggest to use code like if(indexExists(indexFolder)) writer = new IndexWriter(index, new StandardAnalyzer(), false); else writer = new IndexWriter(index, new StandardAnalyzer(), true); //don#t forget to close the indexWriter when you create the index and to open it again I use a indexExists function like boolean indexExists(File indexFolder) return indexFolder.exists() and it works propertly even if that's not the best example of testing the existence of the index 3.'It is here that I get a failure, can't delete _b9.cfs' that's ptobably because of the way you use the searcher, and probably because you don't close the readers, writers and searchers propertly. 4. be sure that all close() methods are guarded with catch(Exception e){ logger.log(e); } blocks 5. Pay attention if you use a multithreading environment, in this case you have to make indexing, delition and search synchronized So ... Have fun, Sergiu PS: I think that I'll submit some code with synchronized index/delete/search operations and to tell why I need to use it. Fred Toth wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors from the indexer. I discovered that restarting tomcat clears the problem. (Note that I'm recreating the index completely, not updating.) I've spent the last couple of hours trolling the archives and I've found numerous references to windows problems with open files. Is there a fix for this? How can I force the files to close? What's the best work-around? Many thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David Spencer writes: could you put the current version of your code on that website as a java Weblog entry updated: http://searchmorph.com/weblog/index.php?id=23 thanks Great suggestion and thanks for that idiom - I should know such things by now. To clarify the issue, it's just a performance one, not other functionality...anyway I put in the code - and to be scientific I benchmarked it two times before the change and two times after - and the results were suprising the same both times (1:45 to 1:50 with an index that takes up 200MB). Probably there are cases where this will run faster, and the code seems more correct now so it's in. Ahh, I see, you check the field later. The logging made me think, you index all fields you loop over, in which case one might get unwanted words into the ngram index. An interesting application of this might be an ngram-Index enhanced version of the FuzzyQuery. While this introduces more complexity on the indexing side, it might be a large speedup for fuzzy searches. I also thinking of reviewing the list to see if anyone had done a Jaro Winkler fuzzy query yet and doing that I went into another direction, and changed the ngram index and search to use a simliarity that computes m * m / ( n1 * n2) where m is the number of matches and n1 is the number of ngrams in the query and n2 is the number of ngrams in the word. (At least if I got that right; I'm not sure if I understand all parts of the similarity class correctly) After removing the document boost in the ngram index based on the word frequency in the original index I find the results pretty good. My data is a number of encyclopedias and dictionaries and I only use the headwords for the ngram index. Term frequency doesn't seem relevent in this case. I still use the levenshtein distance to modify the score and sort according to score / distance but in most cases this does not make a difference. So I'll probably drop the distance calculation completely. I also see few difference between using 2- and 3-grams on the one hand and only using 2-grams on the other. So I'll presumably drop the 3-grams. I'm not sure, if the similarity I use, is useful in general, but I attached it to this message in case someone is interested. Note that you need to set the similarity for the index writer and searcher and thus have to reindex in case you want to give it a try. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Hi Otis, I understand about reusing readers and searchers, but I was working on the do the simplest thing that can possibly work theory for starters, in part because I wanted to be sure that I could recreate the index safely as needed. I should emphasize that I developed for weeks on linux without ever seeing this problem, but in less than 24 hours after installing on the customer's windows box, I hit the error. So is a close() not really a close()? Is Lucene actually hanging on to open files? Or is this a JVM on windows bug? (I'm using the latest 1.4.2 from Sun.) As I mentioned, this has turned up off and on in the mail archives. Is there no well-understood fix or work-around? I'll get a stack trace set up for the next time it happens. Thanks, Fred At 08:35 AM 9/20/2004, you wrote: Fred, I won't get into the details here, but you shouldn't (have to) open a new IndexReader/Searcher on each request (I'll assume the code below is from your Actions'e xecute method). You should cache and re-use IndexReaders (and IndexSearchers). There may be a FAQ entry regarding that, I'm not sure. Closing them on every request is also something you shouldn't do (opening and closing them is, in simple terms, just doing too much work. Open N files, read them, close them. Open N files, read them, close them. And so on) Regarding failing deletion, that's a Windows OS thing - it won't let you remove a file while another process has it open. I am not certain where exactly this error comes from in Lucene (exception stack trace?), but I thought the Lucene code included work-arounds for this. Otis --- Fred Toth [EMAIL PROTECTED] wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors from the indexer. I discovered that restarting tomcat clears the problem. (Note that I'm recreating the index completely, not updating.) I've spent the last couple of hours trolling the archives and I've found numerous references to windows problems with open files. Is there a fix for this? How can I force the files to close? What's the best work-around? Many thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Hi Sergiu, Thanks for your suggestions. I will try using just the IndexSearcher(String...) and see if that makes a difference in the problem. I can confirm that I am doing a proper close() and that I'm checking for exceptions. Again, the problem is not with the search function, but with the command-line indexer. It is not run at startup, but on demand when the index needs to be recreated. Thanks, Fred At 08:50 AM 9/20/2004, you wrote: Hi Fred, That's right, there are many references to this kind of problems in the lucene-user list. This suggestions were already made, but I'll list them once again: 1. One way to use the IndexSearcher is to use yopur code, but I don't encourage users to do that IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); It's better to use the constructor that uses a String to create a IndexSearcher. |*IndexSearcher http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I even suggest that the path to be obtained as File indexFolder = new File(luceneIndex); IndexSearcher searcher = new IndexSearcher(indexFolder.toString()). 2. I can imagine situations when the lucene index must be created at each startup, but I think that this is very rare, so I suggest to use code like if(indexExists(indexFolder)) writer = new IndexWriter(index, new StandardAnalyzer(), false); else writer = new IndexWriter(index, new StandardAnalyzer(), true); //don#t forget to close the indexWriter when you create the index and to open it again I use a indexExists function like boolean indexExists(File indexFolder) return indexFolder.exists() and it works propertly even if that's not the best example of testing the existence of the index 3.'It is here that I get a failure, can't delete _b9.cfs' that's ptobably because of the way you use the searcher, and probably because you don't close the readers, writers and searchers propertly. 4. be sure that all close() methods are guarded with catch(Exception e){ logger.log(e); } blocks 5. Pay attention if you use a multithreading environment, in this case you have to make indexing, delition and search synchronized So ... Have fun, Sergiu PS: I think that I'll submit some code with synchronized index/delete/search operations and to tell why I need to use it. Fred Toth wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors from the indexer. I discovered that restarting tomcat clears the problem. (Note that I'm recreating the index completely, not updating.) I've spent the last couple of hours trolling the archives and I've found numerous references to windows problems with open files. Is there a fix for this? How can I force the files to close? What's the best work-around? Many thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED]
Re[2]: indexes won't close on windows
Hello Fred, When you recreate an index from the scratch (with the last IndexWriter constructor's argument true), all IndexReaders must be closed, cause IndexWriter tries to delete all files entire directory, where you index being created. If you have any opened IndexReader within this time, then Windows locks some of the files, used by IndexReader, preventing them from being changed by other process. Thus IndexWriter's constructor is unable to delete these files and throws IOException. Max Monday, September 20, 2004, 4:40:00 PM, you wrote: FT Hi Sergiu, FT Thanks for your suggestions. I will try using just the IndexSearcher(String...) FT and see if that makes a difference in the problem. I can confirm that FT I am doing a proper close() and that I'm checking for exceptions. Again, FT the problem is not with the search function, but with the command-line FT indexer. It is not run at startup, but on demand when the index needs FT to be recreated. FT Thanks, FT Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexes won't close on windows
Fred Toth wrote: Hi Sergiu, Thanks for your suggestions. I will try using just the IndexSearcher(String...) and see if that makes a difference in the problem. I can confirm that I am doing a proper close() and that I'm checking for exceptions. Again, the problem is not with the search function, but with the command-line indexer. It is not run at startup, but on demand when the index needs to be recreated. Thanks, Fred I remenber it was one case where the searcher was used in the way you use but without keeping the named reference to the index reader. This is not your case. why do you get It is here that I get a failure, can't delete _b9.cfs? are you trying to delete the index folder sometimes or ... why? maybe one object is still using the index when you try to delete it. do you write your errors in log files? It will be very helpful to have a StackTrace. All the best, Sergiu At 08:50 AM 9/20/2004, you wrote: Hi Fred, That's right, there are many references to this kind of problems in the lucene-user list. This suggestions were already made, but I'll list them once again: 1. One way to use the IndexSearcher is to use yopur code, but I don't encourage users to do that IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); It's better to use the constructor that uses a String to create a IndexSearcher. |*IndexSearcher http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I even suggest that the path to be obtained as File indexFolder = new File(luceneIndex); IndexSearcher searcher = new IndexSearcher(indexFolder.toString()). 2. I can imagine situations when the lucene index must be created at each startup, but I think that this is very rare, so I suggest to use code like if(indexExists(indexFolder)) writer = new IndexWriter(index, new StandardAnalyzer(), false); else writer = new IndexWriter(index, new StandardAnalyzer(), true); //don#t forget to close the indexWriter when you create the index and to open it again I use a indexExists function like boolean indexExists(File indexFolder) return indexFolder.exists() and it works propertly even if that's not the best example of testing the existence of the index 3.'It is here that I get a failure, can't delete _b9.cfs' that's ptobably because of the way you use the searcher, and probably because you don't close the readers, writers and searchers propertly. 4. be sure that all close() methods are guarded with catch(Exception e){ logger.log(e); } blocks 5. Pay attention if you use a multithreading environment, in this case you have to make indexing, delition and search synchronized So ... Have fun, Sergiu PS: I think that I'll submit some code with synchronized index/delete/search operations and to tell why I need to use it. Fred Toth wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete
RE: indexes won't close on windows
Hi, I guess you have answered yourself. I can imagine that Tomcat was serving your servlet with constructed index searcher while your command line application wanted to recreate the index. Are you protected against this situation? Jiri. -Original Message- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Monday, September 20, 2004 3:40 PM To: Lucene Users List Subject: Re: indexes won't close on windows Hi Sergiu, Thanks for your suggestions. I will try using just the IndexSearcher(String...) and see if that makes a difference in the problem. I can confirm that I am doing a proper close() and that I'm checking for exceptions. Again, the problem is not with the search function, but with the command-line indexer. It is not run at startup, but on demand when the index needs to be recreated. Thanks, Fred At 08:50 AM 9/20/2004, you wrote: Hi Fred, That's right, there are many references to this kind of problems in the lucene-user list. This suggestions were already made, but I'll list them once again: 1. One way to use the IndexSearcher is to use yopur code, but I don't encourage users to do that IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); It's better to use the constructor that uses a String to create a IndexSearcher. |*IndexSearcher http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I even suggest that the path to be obtained as File indexFolder = new File(luceneIndex); IndexSearcher searcher = new IndexSearcher(indexFolder.toString()). 2. I can imagine situations when the lucene index must be created at each startup, but I think that this is very rare, so I suggest to use code like if(indexExists(indexFolder)) writer = new IndexWriter(index, new StandardAnalyzer(), false); else writer = new IndexWriter(index, new StandardAnalyzer(), true); //don#t forget to close the indexWriter when you create the index and to open it again I use a indexExists function like boolean indexExists(File indexFolder) return indexFolder.exists() and it works propertly even if that's not the best example of testing the existence of the index 3.'It is here that I get a failure, can't delete _b9.cfs' that's ptobably because of the way you use the searcher, and probably because you don't close the readers, writers and searchers propertly. 4. be sure that all close() methods are guarded with catch(Exception e){ logger.log(e); } blocks 5. Pay attention if you use a multithreading environment, in this case you have to make indexing, delition and search synchronized So ... Have fun, Sergiu PS: I think that I'll submit some code with synchronized index/delete/search operations and to tell why I need to use it. Fred Toth wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and works great: create reader create searcher do search extract N docs from hits close searcher close reader However, on several occasions, when trying to re-index, I get can't delete file errors
Re: Running OutOfMemory while optimizing and searching
Doug Thank you for confirming this. ZJ Doug Cutting [EMAIL PROTECTED] wrote: John Z wrote: We have indexes of around 1 million docs and around 25 searchable fields. We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB. Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. Doug, Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory or am I not making any sense ? 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query You make perfect sense. The formula above does not include the .tii. My mistake: I forgot that. By default, every 128th Term in the index is read into memory, to permit random access to terms. These are stored in the .tii file, compressed. So it is not surprising that they require 7x the size of the .tii file in memory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Express yourself with Y! Messenger! Free. Download now.
Too many boolean clauses
Hello There, Due to the fact that the [# TO #] range search works lexographically, I am forced to build a rather large boolean query to get range data from my index. I have an ID field that contains about 500,000 unique ids. If I want to query all records with ids [1-2000], I build a boolean query containing all the numbers in the range. eg. id:(1 2 3 ... 1999 2000) The problem with this is that I get the following error : org.apache.lucene.queryParser.ParseException: Too many boolean clauses Any ideas on how I might circumvent this issue by either finding a way to rewrite the query, or avoid the error? Thanks in advance, Shawn. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many boolean clauses
On Monday 20 September 2004 18:27, Shawn Konopinsky wrote: Hello There, Due to the fact that the [# TO #] range search works lexographically, I am forced to build a rather large boolean query to get range data from my index. I have an ID field that contains about 500,000 unique ids. If I want to query all records with ids [1-2000], I build a boolean query containing all the numbers in the range. eg. id:(1 2 3 ... 1999 2000) The problem with this is that I get the following error : org.apache.lucene.queryParser.ParseException: Too many boolean clauses Any ideas on how I might circumvent this issue by either finding a way to rewrite the query, or avoid the error? You can use this as an example: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/DateFilter.java (Just click view on the latest version to see the code). and iteratate over you doc ids instead of over dates. This will give you a filter for the doc ids you want to query. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Similarity scores: tf(), lengthNorm(), sumOfSquaredWeights().
After last week's discussion on idf() of the similarity score computation I looked into the score computation a bit deeper. In the DefaultSimilarity tf() is the sqrt() and lengthNorm() is the inverse of sqrt(). That means that the factor (docTf * docNorm) actually implements the square root of the density of the query term in the document field (ignoring the encoding and decoding of the norm). Summing these weighted square roots resembles a Salton OR p-Norm for p = 1/2, except that Salton defined the p-Norm's for p = 1, and the result is more like an AND p-Norm because it depends mostly on the minimum argument. The pnorm also requires that the sum is taken to the power 1/p, but this is not necessary as it would not change the ranking. I looked around for p-Norm's with 0p1, but I didn't find anything. Is there really nothing about this? A good discussion is here: http://elvis.slis.indiana.edu/irpub/SIGIR/1994/cite19.htm I would guess that since the sqrt() has an infinite derivative at zero, it might well be that this OR p-Norm for p = 1/2 behaves much like a rather high power AND p-Norm. The basic summing form of the OR p-Norm also allows a very easy implementation by just summing the weighted square roots; an AND p-Norm for p = 1 would have needed some more calculations. Is this perhaps one of the reasons for using a power p 1 ? Taking this a bit further, I also wonder about the name of sumOfSquaredWeights() in the Weight interface. Shouldn't that rather be sumOfPowerWeights() and by default implement a sum of square roots? This would allow a more straightforward comprehension of the of the term weights as directly weighing the term densities. Section 5 of the reference above has the full weighted p-Norm formula's. The OR p-Norm there is very close to the Lucene formula without coord(). Regards, Paul Elschot On Tuesday 14 September 2004 23:49, Doug Cutting wrote: Your analysis sounds correct. At base, a weight is a normalized tf*idf. So a document weight is: docTf * idf * docNorm and a query weight is: queryTf * idf * queryNorm where queryTf is always one. So the product of these is (docTf * idf * docNorm) * (idf * queryNorm), which indeed contains idf twice. I think the best documentation fix would be to add another idf(t) clause at the end of the formula, next to queryNorm(q), so this is clear. Does that sound right to you? Doug Ken McCracken wrote: Hi, I was looking through the score computation when running search, and think there may be a discrepancy between what is _documented_ in the org.apache.lucene.search.Similarity class overview Javadocs, and what actually occurs in the code. I believe the problem is only with the documentation. I'm pretty sure that there should be an idf^2 in the sum. Look at org.apache.lucene.search.TermQuery, the inner class TermWeight. You can see that first sumOfSquaredWeights() is called, followed by normalize(), during search. Further, the resulting value stored in the field value is set as the weightValue on the TermScorer. If we look at what happens to TermWeight, sumOfSquaredWeights() sets queryWeight to idf * boost. During normalize(), queryWeight is multiplied by the query norm, and value is set to queryWeight * idf == idf * boost * query norm * idf == idf^2 * boost * query norm. This becomes the weightValue in the TermScorer that is then used to multiply with the appropriate tf, etc., values. The remaining terms in the Similarity description are properly appended. I also see that the queryNorm effectively cancels out (dimensionally, since it is a 1/ square root of a sum of squares of idfs) one of the idfs, so the formula still ends up being roughly a TF-IDF formula. But the idf^2 should still be there, along with the expansion of queryNorm. Am I mistaken, or is the documentation off? Thanks for your help, -Ken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Too many boolean clauses
Hey Paul, Thanks for the quick reply. Excuse my ignorance, but what do I do with the generated BitSet? Also - we are using a pooling feature which contains a pool of IndexSearchers that are used and tossed back each time we need to search. I'd hate to have to work around this and open up an IndexReader for this particular search, where all other searches use the pool. Suggestions? Thanks, Shawn. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Monday, September 20, 2004 12:51 PM To: Lucene Users List Subject: Re: Too many boolean clauses On Monday 20 September 2004 18:27, Shawn Konopinsky wrote: Hello There, Due to the fact that the [# TO #] range search works lexographically, I am forced to build a rather large boolean query to get range data from my index. I have an ID field that contains about 500,000 unique ids. If I want to query all records with ids [1-2000], I build a boolean query containing all the numbers in the range. eg. id:(1 2 3 ... 1999 2000) The problem with this is that I get the following error : org.apache.lucene.queryParser.ParseException: Too many boolean clauses Any ideas on how I might circumvent this issue by either finding a way to rewrite the query, or avoid the error? You can use this as an example: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/ search/DateFilter.java (Just click view on the latest version to see the code). and iteratate over you doc ids instead of over dates. This will give you a filter for the doc ids you want to query. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: indexes won't close on windows - solved
All, Many thanks for your help and comments. I found a bug in my code where, in obscure circumstances, the indexes were being left open. Now fixed, thanks to everyone's help. Fred At 10:30 AM 9/20/2004, you wrote: Hi, I guess you have answered yourself. I can imagine that Tomcat was serving your servlet with constructed index searcher while your command line application wanted to recreate the index. Are you protected against this situation? Jiri. -Original Message- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Monday, September 20, 2004 3:40 PM To: Lucene Users List Subject: Re: indexes won't close on windows Hi Sergiu, Thanks for your suggestions. I will try using just the IndexSearcher(String...) and see if that makes a difference in the problem. I can confirm that I am doing a proper close() and that I'm checking for exceptions. Again, the problem is not with the search function, but with the command-line indexer. It is not run at startup, but on demand when the index needs to be recreated. Thanks, Fred At 08:50 AM 9/20/2004, you wrote: Hi Fred, That's right, there are many references to this kind of problems in the lucene-user list. This suggestions were already made, but I'll list them once again: 1. One way to use the IndexSearcher is to use yopur code, but I don't encourage users to do that IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); It's better to use the constructor that uses a String to create a IndexSearcher. |*IndexSearcher http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/In dexSearcher.html#IndexSearcher%28java.lang.String%29*(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I even suggest that the path to be obtained as File indexFolder = new File(luceneIndex); IndexSearcher searcher = new IndexSearcher(indexFolder.toString()). 2. I can imagine situations when the lucene index must be created at each startup, but I think that this is very rare, so I suggest to use code like if(indexExists(indexFolder)) writer = new IndexWriter(index, new StandardAnalyzer(), false); else writer = new IndexWriter(index, new StandardAnalyzer(), true); //don#t forget to close the indexWriter when you create the index and to open it again I use a indexExists function like boolean indexExists(File indexFolder) return indexFolder.exists() and it works propertly even if that's not the best example of testing the existence of the index 3.'It is here that I get a failure, can't delete _b9.cfs' that's ptobably because of the way you use the searcher, and probably because you don't close the readers, writers and searchers propertly. 4. be sure that all close() methods are guarded with catch(Exception e){ logger.log(e); } blocks 5. Pay attention if you use a multithreading environment, in this case you have to make indexing, delition and search synchronized So ... Have fun, Sergiu PS: I think that I'll submit some code with synchronized index/delete/search operations and to tell why I need to use it. Fred Toth wrote: Hi Sergiu, My searches take place in tomcat, in a struts action, in a single method Abbreviated code: IndexReader reader = null; IndexSearcher searcher = null; reader = IndexReader.open(indexName); searcher = new IndexSearcher(reader); // code to do a search and extract hits, works fine. searcher.close(); reader.close(); I have a command-line indexer that is a minor modification of the IndexHTML.java that comes with Lucene. It does this: writer = new IndexWriter(index, new StandardAnalyzer(), create); // add docs (with the create flag set true). It is here that I get a failure, can't delete _b9.cfs or similar. This happens when tomcat is completely idle (we're still testing and not live), so all readers and searchers should be closed, as least as far as java is concerned. But windows will not allow the indexer to delete the old index. I restarted tomcat and the problem cleared. It's as if the JVM on windows doesn't get the file closes quite right. I've seen numerous references on this list to similar behavior, but it's not clear what the fix might be. Many thanks, Fred At 02:32 AM 9/20/2004, you wrote: Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene application on linux with no problems, but when I ported to windows for the customer, I started experiencing problems with the index not closing. This prevents re-indexing. I'm using lucene 1.4.1 under tomcat 5.0.28. My search operation is very simple and
Re: Too many boolean clauses
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote: Hey Paul, Thanks for the quick reply. Excuse my ignorance, but what do I do with the generated BitSet? You can return it in in the bits() method of the object implementing your org.apache.lucene.search.Filter (http://jakarta.apache.org/lucene/docs/api/index.html) Then pass the Filter to IndexSearcher.search() with the query. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many boolean clauses
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote: Hey Paul, ... Also - we are using a pooling feature which contains a pool of IndexSearchers that are used and tossed back each time we need to search. I'd hate to have to work around this and open up an IndexReader for this particular search, where all other searches use the pool. Suggestions? You could use a map from the IndexSearcher back to the IndexReader that was used to create it. (It's a bit of a waste because the IndexSearcher has a reader attribute internally.) Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighting PDF file after the search
Hello, I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am wondering if this highlighter is for highlighting indexed documents or can it be used for PDF Files as is ! Please enlighten ! Thanks, Vijay Balasubramanian DPRA Inc., - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighting PDF file after the search
[EMAIL PROTECTED] wrote: Hello, I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am wondering if this highlighter is for highlighting indexed documents or can it be used for PDF Files as is ! Please enlighten ! I did this a few weeks ago. There are two ways, and they both revolve round the same thing, you need the tokenized PDF text available. [a] Store the tokenized PDF text in the index, or in some other file on disk i.e. a cache ( but cache is a misleading term, as you can't have a cache miss unless you can do [b]). [b] Tokenize it on the fly when you call getBestFragments() - the 1st arg, the TokenStream, should be one that takes a PDF file as input and tokenizes it. http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String) Thanks, Vijay Balasubramanian DPRA Inc., - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighting PDF file after the search
Thanks David. I'll give that a shot and let you know. Vijay Balasubramanian DPRA Inc., 214 665 7503 David Spencer dave-lucene-userTo: Lucene Users List [EMAIL PROTECTED] @tropo.com cc: Subject: Re: Highlighting PDF file after the search 09/20/2004 05:02 PM Please respond to Lucene Users List [EMAIL PROTECTED] wrote: Hello, I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am wondering if this highlighter is for highlighting indexed documents or can it be used for PDF Files as is ! Please enlighten ! I did this a few weeks ago. There are two ways, and they both revolve round the same thing, you need the tokenized PDF text available. [a] Store the tokenized PDF text in the index, or in some other file on disk i.e. a cache ( but cache is a misleading term, as you can't have a cache miss unless you can do [b]). [b] Tokenize it on the fly when you call getBestFragments() - the 1st arg, the TokenStream, should be one that takes a PDF file as input and tokenizes it. http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String) Thanks, Vijay Balasubramanian DPRA Inc., - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Highlighting PDF file after the search
From: [EMAIL PROTECTED] I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am wondering if this highlighter is for highlighting indexed documents or can it be used for PDF Files as is ! Please enlighten ! The highlighter code in sandbox can facilitate highlighting of text *extracted* from the PDF, however it does nothing for you to highlight search terms *inside* of the PDF. For that you will need some sort of tool that can modify the PDF on the fly as the user views it. I know of no quick and dirty tool that allows you to do this, though there is quite a few projects and products which allow you to manipulate PDF files which likely can be used to obtain the behavior you are looking for (with some effort on your part). Regards, Bruce Ritchie smime.p7s Description: S/MIME cryptographic signature
Problems with Lucene + BDB (Berkeley DB) integration
Hi everyone, I am trying to use the Lucene + BDB integration from the sandbox (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/). I installed C Berkeley DB 4.2.52 and I have the Lucene jar file. I have an example program that indexes 4 small text files in a directory (its very similar to the IndexFiles.java in the Lucene demo, except that it uses BDB + Lucene). The problem I have is that executing the indexing program generates different results each time I run it. For example: If I start with an empty index, run the indexing program and then query the index I get the correct results; then I delete the index to start from scratch again, and perform the same sequence and I get no results. (?) What puzzles me is the non-deterministic results... the same execution sequence generates two different results. I then wrote a program to dump the index and I found out that the list of files that end up in the index is different every time I index those 4 files. For example: 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm, _4.frq, _4.prx, _4.tii, segments, deletable. (9 files) 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx, _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11 files) Does anyone have any idea why this is happening? Has anyone been able to use the BDB + Lucene integration with no problems? Id appreciate any help or pointers. Thanks! Xtian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problems with Lucene + BDB (Berkeley DB) integration
I used BDB + lucene successfully using the lucene 1.3 distribution, but it broke in my application with the 1.4 distribution. The 1.4 dist uses a different file system by default, the cluster file system, so maybe that is the source of the issues. good luck, andy g On Mon, 20 Sep 2004 19:36:51 -0300, Christian Rodriguez [EMAIL PROTECTED] wrote: Hi everyone, I am trying to use the Lucene + BDB integration from the sandbox (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/). I installed C Berkeley DB 4.2.52 and I have the Lucene jar file. I have an example program that indexes 4 small text files in a directory (its very similar to the IndexFiles.java in the Lucene demo, except that it uses BDB + Lucene). The problem I have is that executing the indexing program generates different results each time I run it. For example: If I start with an empty index, run the indexing program and then query the index I get the correct results; then I delete the index to start from scratch again, and perform the same sequence and I get no results. (?) What puzzles me is the non-deterministic results... the same execution sequence generates two different results. I then wrote a program to dump the index and I found out that the list of files that end up in the index is different every time I index those 4 files. For example: 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm, _4.frq, _4.prx, _4.tii, segments, deletable. (9 files) 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx, _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11 files) Does anyone have any idea why this is happening? Has anyone been able to use the BDB + Lucene integration with no problems? Id appreciate any help or pointers. Thanks! Xtian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use of SortComparator.getComparable() ?
Dear all, I'm recently implementing a sort logic that leverages an external index, however, I'm confused by the newComparator() and getComparable() in SortComparator. It seems natural to me that IndexSearcher - FieldSortedHitQueue - factory.newComparator(). However, what's the use of getComparable() if newComparator() is doing the job? Any use scenario? Thanks Tea - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
WildCardQuery
Is there a limitation in Lucene when it comes to wildcard search ? Is it a problem if we use less than 3 characters along with a wildcard(*). Gives me error if I try using 45* , *34 , *3 ..etc . Too Many Clauses Error Doesn't happen if '?' is used instead of '*'. The intriguing thing is , that it is not consistent . 00* doesn't fail. Am I missing something ? Robin This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]