Re: BooleanQuery - Too Many Clases on date range.
BTW, what's wrong with the DateFilter solution, I mentionned earlier? I've used it before (before lucene-1.4 though) without memory problems, thus I always assumed that it avoided the allocation problems with prefix queries. sv On Mon, 4 Oct 2004, Chris Fraschetti wrote: Surely some folks out there have used lucene on a large scale and have had to compensate for this somehow, any other solutions? Morus, thank you very more for your imput, and I am looking into your solution, just putting my feelers out there once more. The lucene API is very limited as to it's descriptions of it's components, short of digging into the code, is there a good doc somewhere out there that explains the workins of lucene? On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti [EMAIL PROTECTED] wrote: So before I spend a significant amount of time digging into the lucene code, how does your experience with lucene give light to my situation Our current index is pretty huge, and with each increase in side i've had, i've experienced a problem like this... Without taking up too much of your time.. because obviously this i my task, I thought i'd ask you if you'd had any experience with this boolean clause nonsense... of course it can be overcome, but if you know a quick hack, awesome, otherwise.. no big, but off to work i go :) -Fraschetti -- Forwarded message -- From: Morus Walter [EMAIL PROTECTED] Date: Mon, 4 Oct 2004 09:01:50 +0200 Subject: Re: BooleanQuery - Too Many Clases on date range. To: Lucene Users List [EMAIL PROTECTED], Chris Fraschetti [EMAIL PROTECTED] Chris Fraschetti writes: So i decicded to move my epoch date to the 20040608 date which fixed my boolean query problem in regards to my current data size (approx 600,000) but now as soon as I do a query like ... a* I get the boolean error again. Google obviously can handle this query, and I'm pretty sure lucene can handle it.. any ideas? With out without a date dange specified i still get the TooManyClauses error. I tired cranking the maxclauses up to Integer.MaxInt, but java gave me a out of memory error. Is this b/c the boolean search tried to allocate that many clauses by default or because my query actually needed that many clauses? boolean search allocates clauses for all tokens having the prefix or matching the wildcard expression. Why does it work on small indexes but not large? Because there are fewer tokens starting with a. Is there any way to have the parser create as many clauses as it can and then search with what it has? w/o recompiling the source? You need to create your own version of Wildcard- and Prefix-Query that takes a maximum term number and ignores further clauses. And you need a variant of the query parser that uses these queries. This can be done, even without recompiling lucene, but you will have to do some programming at the level of lucene queries. Shouldn't be hard, since you can use the sources as a starting point. I guess this does not exist because the lucene developer decided to prefer a query error rather than uncomplete results. Morus -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - Too Many Clases on date range.
Ok, got it, got a small comment though. For large wildcard queries, please note that google does not support wild cards. Search hell*, and there will be no correct matches with hello. Is there a reason why you wish to allow such large queries? We might be able to find alternative ways of helping you out. No one will use a query a*. If someone does, the results would be completely meaningless (many false positives for a user). However a query like program* might be interesting to a user. The problem with hacking term expansion is that the rules of this expansion might be hard to define (as is maybe one should use the first, the most frequent terms or the even the least frequent, depending on your app). sv On Mon, 4 Oct 2004, Chris Fraschetti wrote: The date portion of my code works great now.. no problems there, so let me thank you now for your date filter solution... but my current problem is in regards to a stand alone a* query giving me the too many clauses exception On Mon, 4 Oct 2004 12:47:24 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: BTW, what's wrong with the DateFilter solution, I mentionned earlier? I've used it before (before lucene-1.4 though) without memory problems, thus I always assumed that it avoided the allocation problems with prefix queries. sv On Mon, 4 Oct 2004, Chris Fraschetti wrote: Surely some folks out there have used lucene on a large scale and have had to compensate for this somehow, any other solutions? Morus, thank you very more for your imput, and I am looking into your solution, just putting my feelers out there once more. The lucene API is very limited as to it's descriptions of it's components, short of digging into the code, is there a good doc somewhere out there that explains the workins of lucene? On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti [EMAIL PROTECTED] wrote: So before I spend a significant amount of time digging into the lucene code, how does your experience with lucene give light to my situation Our current index is pretty huge, and with each increase in side i've had, i've experienced a problem like this... Without taking up too much of your time.. because obviously this i my task, I thought i'd ask you if you'd had any experience with this boolean clause nonsense... of course it can be overcome, but if you know a quick hack, awesome, otherwise.. no big, but off to work i go :) -Fraschetti -- Forwarded message -- From: Morus Walter [EMAIL PROTECTED] Date: Mon, 4 Oct 2004 09:01:50 +0200 Subject: Re: BooleanQuery - Too Many Clases on date range. To: Lucene Users List [EMAIL PROTECTED], Chris Fraschetti [EMAIL PROTECTED] Chris Fraschetti writes: So i decicded to move my epoch date to the 20040608 date which fixed my boolean query problem in regards to my current data size (approx 600,000) but now as soon as I do a query like ... a* I get the boolean error again. Google obviously can handle this query, and I'm pretty sure lucene can handle it.. any ideas? With out without a date dange specified i still get the TooManyClauses error. I tired cranking the maxclauses up to Integer.MaxInt, but java gave me a out of memory error. Is this b/c the boolean search tried to allocate that many clauses by default or because my query actually needed that many clauses? boolean search allocates clauses for all tokens having the prefix or matching the wildcard expression. Why does it work on small indexes but not large? Because there are fewer tokens starting with a. Is there any way to have the parser create as many clauses as it can and then search with what it has? w/o recompiling the source? You need to create your own version of Wildcard- and Prefix-Query that takes a maximum term number and ignores further clauses. And you need a variant of the query parser that uses these queries. This can be done, even without recompiling lucene, but you will have to do some programming at the level of lucene queries. Shouldn't be hard, since you can use the sources as a starting point. I guess this does not exist because the lucene developer decided to prefer a query error rather than uncomplete results. Morus -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e
Re: BooleanQuery - Too Many Clases on date range.
I've used the simple message that the user's request was too vague and that he should modify it. I haven't had too many complaints about this especially when I explained why to a client: If one user of many does a*, the whole system will grind to a halt as that one request will use up all of the available memory (wildcards aren't very scalable...). Here is an example of a working system: http://theserverside.com/search/search.tss I don't know if many people complain that when they do a*, that no results appear, but a request for javap* returns javapro, javaplus, javapolis... HTH, sv On Mon, 4 Oct 2004, Chris Fraschetti wrote: absoultely, limiting the user's query is no problem here. I've currently implemented the lucene javascript to catcha lot of user quries that could cause issues.. blank queries, ? or * at the beginning of query, etc etc... but I couldn't think of a way to prevent the user from doing a* but not comment* wanting comments or commentary... any suggestions would be warmly welcomed. On Mon, 4 Oct 2004 14:08:00 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: Ok, got it, got a small comment though. For large wildcard queries, please note that google does not support wild cards. Search hell*, and there will be no correct matches with hello. Is there a reason why you wish to allow such large queries? We might be able to find alternative ways of helping you out. No one will use a query a*. If someone does, the results would be completely meaningless (many false positives for a user). However a query like program* might be interesting to a user. The problem with hacking term expansion is that the rules of this expansion might be hard to define (as is maybe one should use the first, the most frequent terms or the even the least frequent, depending on your app). sv On Mon, 4 Oct 2004, Chris Fraschetti wrote: The date portion of my code works great now.. no problems there, so let me thank you now for your date filter solution... but my current problem is in regards to a stand alone a* query giving me the too many clauses exception On Mon, 4 Oct 2004 12:47:24 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: BTW, what's wrong with the DateFilter solution, I mentionned earlier? I've used it before (before lucene-1.4 though) without memory problems, thus I always assumed that it avoided the allocation problems with prefix queries. sv On Mon, 4 Oct 2004, Chris Fraschetti wrote: Surely some folks out there have used lucene on a large scale and have had to compensate for this somehow, any other solutions? Morus, thank you very more for your imput, and I am looking into your solution, just putting my feelers out there once more. The lucene API is very limited as to it's descriptions of it's components, short of digging into the code, is there a good doc somewhere out there that explains the workins of lucene? On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti [EMAIL PROTECTED] wrote: So before I spend a significant amount of time digging into the lucene code, how does your experience with lucene give light to my situation Our current index is pretty huge, and with each increase in side i've had, i've experienced a problem like this... Without taking up too much of your time.. because obviously this i my task, I thought i'd ask you if you'd had any experience with this boolean clause nonsense... of course it can be overcome, but if you know a quick hack, awesome, otherwise.. no big, but off to work i go :) -Fraschetti -- Forwarded message -- From: Morus Walter [EMAIL PROTECTED] Date: Mon, 4 Oct 2004 09:01:50 +0200 Subject: Re: BooleanQuery - Too Many Clases on date range. To: Lucene Users List [EMAIL PROTECTED], Chris Fraschetti [EMAIL PROTECTED] Chris Fraschetti writes: So i decicded to move my epoch date to the 20040608 date which fixed my boolean query problem in regards to my current data size (approx 600,000) but now as soon as I do a query like ... a* I get the boolean error again. Google obviously can handle this query, and I'm pretty sure lucene can handle it.. any ideas? With out without a date dange specified i still get the TooManyClauses error. I tired cranking the maxclauses up to Integer.MaxInt, but java gave me a out of memory error. Is this b/c the boolean search tried to allocate that many clauses by default or because my query actually needed that many clauses? boolean search allocates clauses for all tokens having the prefix or matching the wildcard expression. Why does
Re: WildCardQuery
On Fri, 1 Oct 2004, Robinson Raju wrote: analyzer is StandardAnalyzer. i use MultiFieldQueryParser to parse. The flow is this: I have indexed a Database view. Now i need to search against a few columns i take in the search criteria and search field , construct a wildcard query and add it to a boolean query WildcardQuery wQuery = new WildcardQuery(new Term(searchFields[0], searchString)); What is the value of searchString? Is it a word? QueryParser syntax is not applied here. Whats does ab* return? booleanQuery.add(wQuery, true, false); Query queryfilter = MultiFieldQueryParser.parse(filterString, filterFields, flags, analyzer); hits = parallelMultiSearcher.search(booleanQuery,queryFilter); when i dont use wild cards , it is taken as +((ITM_SHRT_DSC:natal ITM_SHRT_DSC:tylenol) (ITM_LONG_DSC:natal ITM_LONG_DSC:tylenol)) But when wildcard is used , it is taken as +ITM_SHRT_DSC:nat* tylenol +ITM_LONG_DSC:nat* Tylenol ITM_XXX fields are tokenized? sv the first return around 300 records , the second , 0. any help would be appreciated Thanks Robin On Fri, 1 Oct 2004 02:06:04 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: Can you be a little more precise about how you process your documents? 1) What's your analyser? SimpleAnalyzer? 2) How do you parse the query? Out-of-the-box QueryParser? can we not enter space or do an OR search with two words one of which has a wildcard ? Simple answer, yes. Complicated answer, words are delimited by your tokeniser. That's included in your analyser (hence my question above). The asterix syntax comes from using a query parser that transforms the query into a PrefixQuery object. sv On Fri, 1 Oct 2004, Robinson Raju w Hi , Would there be a problem if one enters space while using wildcards ? say i search for 'abc' . i get 100 hits as results 'man' gives - 200 'abc man' gives 300 but 'ab* man' 'abc ma*' ab* ma*' ab* OR ma* .. all of these return 0 results. can we not enter space or do an OR search with two words one of which has a wildcard ? Regards, Robin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: WildCardQuery
Can you be a little more precise about how you process your documents? 1) What's your analyser? SimpleAnalyzer? 2) How do you parse the query? Out-of-the-box QueryParser? can we not enter space or do an OR search with two words one of which has a wildcard ? Simple answer, yes. Complicated answer, words are delimited by your tokeniser. That's included in your analyser (hence my question above). The asterix syntax comes from using a query parser that transforms the query into a PrefixQuery object. sv On Fri, 1 Oct 2004, Robinson Raju w Hi , Would there be a problem if one enters space while using wildcards ? say i search for 'abc' . i get 100 hits as results 'man' gives - 200 'abc man' gives 300 but 'ab* man' 'abc ma*' ab* ma*' ab* OR ma* .. all of these return 0 results. can we not enter space or do an OR search with two words one of which has a wildcard ? Regards, Robin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - Too Many Clases on date range.
How about a DateFilter? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DateFilter.html I don't believe it's got the same restrictions as boolean queries. HTH, sv On Thu, 30 Sep 2004, Chris Fraschetti wrote: I recently read in regards to my problem that date_field:[0820483200 TO 110448] is evluated into a series of boolean queries ... which has a cap of 1024 ... considering my documents will have dates spanning over many years, and i need the granualirity of 'by day' searching, are there any reccomendations on how to make this work? Currently with query: +content_field:sometext +date_field:[0820483200 TO 110448] I get the following exception: org.apache.lucene.search.BooleanQuery$TooManyClauses any suggestions on how I can still keep the granuality of by day, but without limiting my search results? Are there any date formats that I can change those numbers to that would allow me to complete the search (i.e. Feb, 15 2004 ) .. can lucene's range do a proper search on formatted dates? Is there a combination of RangeQuery and Query/MultiTermQuery that I can use? your help is greatly appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WordListLoader's whereabouts
Hi Tate, From the commit: http://www.mail-archive.com/[EMAIL PROTECTED]/msg06510.html I'd say you can use the german WordListLoader (renaming it or using a nightly cvs version of the refactored class). I think there might be a versionning issue here as from: http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard It is mentionned that: DONE: Move language-specific analyzers into separate downloads. Also move analysis/de/WordlistLoader.java one level upwards, as it's not specific to German at all. That should be only applicable for lucene 1.9... Last version comment for BrazilianAnalyzer: move the word list loader from analysis.de to analysis, as it is not specific to German at all; update the references to it HTH, sv On Mon, 27 Sep 2004, Tate Avery wrote: Hello, I am trying to compile the analyzers from the Lucene sandbox contributions. Many of them seem to import org.apache.lucene.analysis.WordlistLoader which is not currently in my classpath. Does anyone know where I can find this class? It does not appear to be in Lucene 1.4, so I am assuming it is another contribution perhaps? Any help in tracking it down would be appreciated. Also, some of the analyzers appear to have their own copy of this class (i.e. org.apache.lucene.analysis.nl.WordlistLoader). Could I just relocate that one to the shared package, perhaps? Thanks, Tate - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Hi Niraj, I'd rather respond to the list as others may be interested in your questions, and since I don't consider myself a guru, I appreciate being corrected. For a title, I'd say yes, use the Field Text(String name, String value) constructor. Not the others that use a reader as they do not store the value. You want for it to be: 1) tokenised (so to have its fragments saved for searching, not only the totality of the text) 2) indexed so to make it searchable 3) store as to make the field retrievable from the index hth, sv p.s. my name is Stephane, it's been a while since I've been in Oz that I haven't been called James On Wed, 1 Sep 2004, Niraj Alok wrote: Hi James, Since this would be a minor issue hence I am not posting it on the lucene. Lets say I have one field as title which has a value of George Bush. I would need to search on that title and also retrieve its value. So you are saying that I should have it as Field.Text? Also, if I need to just search on that title but want to retrieve the value of another field content, then title should be unstored while content should be stored? Regards, Niraj - Original Message - From: Stephane James Vaucher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 10:59 AM Subject: Re: indexing size On Wed, 1 Sep 2004, Niraj Alok wrote I was also thinking on the same lines. Actually the original code was written by some one else who has left and so I have to own this. At almost all the places, it is Field.Text and at some few places its Field.UnIndexed. I looked at the javadocs and found that there is Field.UnStored also. The problem is I am not too sure which one to change to what. It would be really enlightening if you could point the differences between those three and what would I need to change in my search code. If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? Basically what is meant by indexed and stored, indexed and not stored and not indexed and stored? If all you need is to seach a field, you do not need to store it. If it is not stored it can still be tokenised and analysed by lucene. It will then be only stored as a set of token, but not as whole. You can thus use it for fields that you never need to retrieve from the index. For example: the quick brown fox jumped over the lazy dog. will be store in lucene only as tokens, not as a whole, so using a whitespace analyser using a stopword list {the}: You will have these tokens in lucene: quick brown fox jumped over dog You will NOT be able to retrieve the original text, but you will be able to search it. HTH, sv Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 8:57 PM Subject: Re: indexing size On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
On Wed, 1 Sep 2004, Niraj Alok wrote I was also thinking on the same lines. Actually the original code was written by some one else who has left and so I have to own this. At almost all the places, it is Field.Text and at some few places its Field.UnIndexed. I looked at the javadocs and found that there is Field.UnStored also. The problem is I am not too sure which one to change to what. It would be really enlightening if you could point the differences between those three and what would I need to change in my search code. If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? Basically what is meant by indexed and stored, indexed and not stored and not indexed and stored? If all you need is to seach a field, you do not need to store it. If it is not stored it can still be tokenised and analysed by lucene. It will then be only stored as a set of token, but not as whole. You can thus use it for fields that you never need to retrieve from the index. For example: the quick brown fox jumped over the lazy dog. will be store in lucene only as tokens, not as a whole, so using a whitespace analyser using a stopword list {the}: You will have these tokens in lucene: quick brown fox jumped over dog You will NOT be able to retrieve the original text, but you will be able to search it. HTH, sv Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 8:57 PM Subject: Re: indexing size On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Range query problem
A description on how to search numerical fields is available on the wiki: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields sv On Thu, 26 Aug 2004, Alex Kiselevski wrote: Thanks, I'll try it -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Thursday, August 26, 2004 12:59 PM To: Lucene Users List Subject: Re: Range query problem On Thursday 26 August 2004 11:02, Alex Kiselevski wrote: I have a strange problem with range query PERIOD:[1 TO 9] It works only if the second parameter is equals or less than 9 If it's greater than 9 , it finds no documents You have to store your numbers so that they will appear in the right order when sorted lexicographically, e.g. save 1 as 01 if you save numbers up to 99, or as 0001 if you save numbers up to . You also have to use this format for searching I think. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what is wrong with query
You'll have to give us more information than that... What is the problem you are seeing? I'll assume that you get no results. Tell us of the structure of your documents and how you index every field. Concerning your syntax, if you are using the distributed query parser, you don't need the + before name, nor the + before university as they will be added by the parser. sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: Hi, pls, Tell me what is wrong with query: author:( +name AND full name~) AND book:( +university) Alex Kiselevsky Speech TechnologyTel:972-9-776-43-46 RD, Amdocs - Israel Mobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: what is wrong with query
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Fuzzy Searches Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, ~, symbol at the end of a Single word Term. I haven't used fuzzy searches, but it seems to indicate that it can only be used with single word terms. The query parser might have been written to support that (the output indicates that as well). HTH, sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: I use QueryParser And I got an exception : org.apache.lucene.queryParser.ParseException: Encountered ~ at line 1, column 44. Was expecting one of: AND ... OR ... NOT ... + ... - ... ( ... ) ... ^ ... QUOTED ... TERM ... SLOP ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... at org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa rser.java:1045 at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j ava:925) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) at com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89) at com.stp.test.CVTest.main(CVTest.java:223) -Original Message- From: Stephane James Vaucher [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 25, 2004 10:07 AM To: Lucene Users List Subject: Re: what is wrong with query You'll have to give us more information than that... What is the problem you are seeing? I'll assume that you get no results. Tell us of the structure of your documents and how you index every field. Concerning your syntax, if you are using the distributed query parser, you don't need the + before name, nor the + before university as they will be added by the parser. sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: Hi, pls, Tell me what is wrong with query: author:( +name AND full name~) AND book:( +university) Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 RD, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Time to index documents
I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed, 25 Aug 2004, Hetan Shah wrote: Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the following output on the screen. adding ../31/1104852.html Parse Aborted: Encountered \ at line 7, column 1. Was expecting one of: ArgName ... = ... TagEnd ... Any suggestions on preventing this from happening? Thanks in advance. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Time to index documents
JGuru explanation: http://www.jguru.com/faq/view.jsp?EID=1074228 I have no sample code for neko, I think nutch uses it though. For tidy, you can look at ant in the sandbox: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3view=markup HTH, sv On Wed, 25 Aug 2004, Hetan Shah wrote: Do you have any pointers for sample code for them? Would highly appreciate it. Thanks. -H Stephane James Vaucher wrote: I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed, 25 Aug 2004, Hetan Shah wrote: Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the following output on the screen. adding ../31/1104852.html Parse Aborted: Encountered \ at line 7, column 1. Was expecting one of: ArgName ... = ... TagEnd ... Any suggestions on preventing this from happening? Thanks in advance. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Time to index documents
Hetan, If you are using a corpus with multiple editors, I suggest that you use a cleaner like tidy as there might be weird stuff appearing in the html. sv On Thu, 26 Aug 2004, Karthik N S wrote: Hi Hetan Th's the major Problem of non Standatrdized Tags for HTML Document's u are Indexing ,resulting in lag time taken for Indexing process If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html' file [U have to have some Knowledge of JAVACC for this]. Karthik -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, August 26, 2004 3:01 AM To: Lucene Users List Subject: Time to index documents Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the following output on the screen. adding ../31/1104852.html Parse Aborted: Encountered \ at line 7, column 1. Was expecting one of: ArgName ... = ... TagEnd ... Any suggestions on preventing this from happening? Thanks in advance. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene PDF indexing
You need to add log4j to your classpath: http://logging.apache.org/log4j/docs/ sv On 24 Aug 2004, sivalingam T wrote: Hi I have written one files for PDF Indexing. Here I have written as follows .. This is my IndexPDF file. import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import java.io.File; import java.util.Date; import java.util.Arrays; class IndexPDF { private static boolean deleting = false;// true during deletion pass private static IndexReader reader; // existing index private static IndexWriter writer; // new index being built private static TermEnum uidIter;// document id iterator public static void main(String[] argv) { try { String index = index; boolean create = false; File root = null; String usage = IndexHTML [-create] [-index index] root_directory; if (argv.length == 0) { System.err.println(Usage: + usage); return; } for (int i = 0; i argv.length; i++) { if (argv[i].equals(-index)) { // parse -index option index = argv[++i]; } else if (argv[i].equals(-create)) { // parse -create option create = true; } else if (i != argv.length-1) { System.err.println(Usage: + usage); return; } else root = new File(argv[i]); } Date start = new Date(); if (!create) { // delete stale docs deleting = true; indexDocs(root, index, create); } writer = new IndexWriter(index, new StandardAnalyzer(), create); writer.maxFieldLength = 100; indexDocs(root, index, create); // add new docs System.out.println(Optimizing index...); writer.optimize(); writer.close(); Date end = new Date(); System.out.print(end.getTime() - start.getTime()); System.out.println( total milliseconds); } catch (Exception e) { System.out.println( caught a + e.getClass() + \n with message: + e.getMessage()); } } /* Walk directory hierarchy in uid order, while keeping uid iterator from /* existing index in sync. Mismatches indicate one of: (a) old documents to /* be deleted; (b) unchanged documents, to be left alone; or (c) new /* documents, to be indexed. */ private static void indexDocs(File file, String index, boolean create) throws Exception { if (!create) {// incrementally update reader = IndexReader.open(index); // open existing index uidIter = reader.terms(new Term(uid, )); // init uid iterator indexDocs(file); if (deleting) { // delete rest of stale docs while (uidIter.term() != null uidIter.term().field() == uid) { System.out.println(deleting + HTMLDocument.uid2url(uidIter.term().text())); reader.delete(uidIter.term()); uidIter.next(); } deleting = false; } uidIter.close();// close uid iterator reader.close(); // close existing index } else// don't have exisiting indexDocs(file); } private static void indexDocs(File file) throws Exception { if (file.isDirectory()) { // if a directory String[] files = file.list(); // list its files Arrays.sort(files); // sort the files for (int i = 0; i files.length; i++) { // recursively index them indexDocs(new File(file, files[i])); } } if ((file.getPath().endsWith(.pdf )) || (file.getPath().endsWith(.PDF ))) { System.out.println( Indexing PDF document: + file ); try { //Document doc = LucenePDFDocument.getDocument( file ); writer.addDocument(LucenePDFDocument.getDocument( file)); } catch(Exception e) {} } } } when i use the following commands, the exceptions are thrown if anybody know please inform me. C:\java org.apache.lucene.demo.IndexPDF -create -index c:\lucene\pdf c:\pdfs\Words.pdf Indexing PDF document: c:\pdfs\Words.pdf Exception in thread main java.lang.NoClassDefFoundError: org/apache/log4j/Cate gory at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDF Document.java:197) at
Re: pdfboxhelp
Your classpath should point to a directory that contains log4j.properties, not the file directly, see below. sv On Mon, 23 Aug 2004, Santosh wrote: Hi natarajan, I kept log4j.properties in the classpath my new classpath is C:\j2sdk1.4.1\lib\log4j.properties; should be C:\j2sdk1.4.1\lib\ but there is no difference in the output - Original Message - From: Natarajan.T [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, August 23, 2004 10:56 AM Subject: RE: pdfboxhelp Hi Santhosh, The attached file must be in your class path. Natarajan. -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 10:51 AM To: Lucene Users List Subject: Fw: pdfboxhelp hi karthik, did u find any solution? should I send the pdf to u? - Original Message - From: Santosh [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 10:23 AM Subject: Re: pdfboxhelp hi karthik, I kept log4j in the classpath , I am sending classpath variable CLASSPATH .;..;C:\j2sdk1.4.1\lib;C:\j2sdk1.4.1\lib\jndi.jar;C:\j2sdk1.4.1\lib\webc lien t.jar;C:\j2sdk1.4.1\lib\mail.jar;C:\j2sdk1.4.1\lib\activation.jar;C:\j2s dk1. 4.1\lib\xml-apis.jar;D:\JAVAPRO;C:\j2sdk1.4.1\jre\lib\ext\msbase.jar;C:\ j2sd k1.4.1\lib\servlet.jar;E:\Program Files\Apache Tomcat 4.0\common\lib\servlet.jar;C:\Program Files\Altova\xmlspy\XMLSpyInterface.jar;C:\j2sdk1.4.1\lib\sax.jar;C:\j2s dk1. 4.1\lib\dom.jar;C:\j2sdk1.4.1\lib\xalan.jar;C:\j2sdk1.4.1\lib\xercesImpl .jar ;C:\j2sdk1.4.1\lib\xmlParserAPIs.jar;C:\j2sdk1.4.1\lib\parser.jar;C:\j2s dk1. 4.1\lib\jaxp.jar;C:\j2sdk1.4.1\lib\xml.jar;C:\j2sdk1.4.1\lib\classes12.z ip;C :\struts.jar;F:\apache-ant-1.6.1\lib\ant.jar;C:\j2sdk1.4.1\lib\PDFBox-0. 6.6. jar;C:\j2sdk1.4.1\lib\lucene-20030909.jar;D:\setups\searchEngine\PDFBox- 0.6. 6\external\log4j.jar please check the error - Original Message - From: Karthik N S [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 10:26 AM Subject: RE: pdfboxhelp Hi Santosh I think u'r Pdf is using Log4j package ,Try toe set the classpath for log4j.jar path. [ Is it a just a WARNING or an ERROR u are getting. Send me in u'r Configuration management Let me help u with it ; [ Karthik -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 10:11 AM To: Lucene Users List Cc: Ben Litchfield Subject: Re: pdfboxhelp hi karthik, I have downloaded pdfbox and kept pdfjar file in the classpath, but when I am typing following command in the command prompt I am getting the error: D:\setups\searchEngine\PDFBox-0.6.6\srcjava org.pdfbox.ExtractText C:\test.pdf C:\test.txt log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse r). log4j:WARN Please initialize the log4j system properly why I am getting this error? plz help - Original Message - From: Karthik N S [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 9:21 AM Subject: RE: pdfboxhelp Hi To Begin with try to build Indexes offline [ out of Tomcat container] and on completing indxexes, feed u'r search with the realpath of the offline indexed folder,Start the Tomcat and then use the search on As u experiment it out u will be comfortable withrequirment of Indexing /Search.. ; [ Karthik -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Saturday, August 21, 2004 4:55 PM To: Lucene Users List Subject: Re: pdfboxhelp Yes I did the same. I copied all the classes into classes folder but now when I am building the index using IndexHTML the pdfs are not added to this index, only text and htmls are added to index. what changes should I do for IndexHTML.java to build index with pdf - Original Message - From: Karthik N S [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, August 21, 2004 4:54 PM Subject: RE: pdfboxhelp Hi If u are using the jar file with Web Interface for jsp/servlet dev, Place the jar file in webapps/u'rapplication/Web-inf/lib and also correct the Classpath for the present modification. 2)create u'r own package and put all u'r java files copy the java files to /Web-inf/Classes/u'r package Then use the same..;{ Karthik -Original Message- From: Santosh
Re: Lucene Search Applet
Hi Simon, Does this work? From FSDirectory api: If the system property 'disableLuceneLocks' has the String value of true, lock creation will be disabled. Otherwise, I think there was a Read-Only Directory hack: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html HTH, sv On Mon, 23 Aug 2004, Simon mcIlwaine wrote: Thanks Jon that works by putting the jar file in the archive attribute. Now im getting the disablelock error cause of the unsigned applet. Do I just comment out the code anywhere where System.getProperty() appears in the files that you specified and then update the JAR Archive?? Is it possible you could show me one of the hacked files so that I know what I'm modifying? Does anyone else know if there is another way of doing this without having to hack the source code? Many thanks. Simon - Original Message - From: Jon Schuster [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, August 21, 2004 2:08 AM Subject: Re: Lucene Search Applet I have Lucene working in an applet and I've seen this problem only when the jar file really was not available (typo in the jar name), which is what you'd expect. It's possible that the classpath for your application is not the same as the classpath for the applet; perhaps they're using different VMs or JREs from different locations. Try referencing the Lucene jar file in the archive attribute of the applet tag. Also, to get Lucene to work from an unsigned applet, I had to modify a few classes that call System.getProperty(), because the properties that were being requested were disallowed for applets. I think the classes were IndexWriter, FSDirectory, and BooleanQuery. --Jon On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote: Im a new Lucene User and I'm not too familiar with Applets either but I've been doing a bit of testing on java applet security and if im correct in saying that applets can read anything below there codebase then my problem is not a security restriction one. The error is reading java.lang.NoClassDefFoundError and the classpath is set as I have it working in a Swing App. Does someone actually have Lucene working in an Applet? Can it be done?? Please help. Thanks Simon - Original Message - From: Terry Steichen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 4:17 PM Subject: Re: Lucene Search Applet I suspect it has to do with the security restrictions of the applet, 'cause it doesn't appear to be finding your Lucene jar file. Also, regarding the lock files, I believe you can disable the locking stuff just for purposes like yours (read-only index). Regards, Terry - Original Message - From: Simon mcIlwaine To: Lucene Users List Sent: Wednesday, August 18, 2004 11:03 AM Subject: Lucene Search Applet Im developing a Lucene CD-ROM based search which will search html pages on CD-ROM, using an applet as the UI. I know that theres a problem with lock files and also security restrictions on applets so I am using the RAMDirectory. I have it working in a Swing application however when I put it into an applet its giving me problems. It compiles but when I go to run the applet I get the error below. Can anyone help? Thanks in advance. Simon Error: Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory At: Java.lang.Class.getDeclaredConstructors0(Native Method) At: Java.lang.Class.privateGetDeclaredConstructors(Class.java:1610) At: Java.lang.Class.getConstructor0(Class.java:1922) At: Java.lang.Class.newInstance0(Class.java:278) At: Java.lang.Class.newInstance(Class.java:261) At: sun.applet.AppletPanel.createApplet(AppletPanel.java:617) At: sun.applet.AppletPanel.runloader(AppletPanel.java:546) At: sun.applet.AppletPanel.run(AppletPanel.java:298) At: java.lang.Thread.run(Thread.java:534) Code: import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.store.Directory; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import java.awt.*; import java.awt.event.*; import javax.swing.*; import java.io.*; public class MemorialApp2 extends JApplet implements ActionListener{ JLabel prompt; JTextField input; JButton search; JPanel panel; String indexDir = C:/Java/lucene/index-list; private static RAMDirectory idx; public void init(){ Container cp = getContentPane(); panel = new JPanel();
Re: Lucene Search Applet
I haven't used it, and I'm a little confused from the code: /** ... * pIf the system property 'disableLuceneLocks' has the String value of * true, lock creation will be disabled. */ public final class FSDirectory extends Directory { private static final boolean DISABLE_LOCKS = Boolean.getBoolean(disableLuceneLocks) || Constants.JAVA_1_1; ... I don't see a System.getProperty(String). You might have to patch this, if I'm correct. This should stop the Directory from trying to use locks. HTH, sv On Mon, 23 Aug 2004, Simon mcIlwaine wrote: Hi Stephane, A bit of a stupid question but how do you mean set the system property disableLuceneLocks=true? Can I do it from a call from FSDirectory API or do I have to actually hack the code? Also if I do use RODirectory how do I go about using it? Do I have to update the Lucene JAR archive file with RODirectory class included as I tried using it and its not recognising the class? Many Thanks Simon - Original Message - From: Stephane James Vaucher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 2:22 PM Subject: Re: Lucene Search Applet Hi Simon, Does this work? From FSDirectory api: If the system property 'disableLuceneLocks' has the String value of true, lock creation will be disabled. Otherwise, I think there was a Read-Only Directory hack: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html HTH, sv On Mon, 23 Aug 2004, Simon mcIlwaine wrote: Thanks Jon that works by putting the jar file in the archive attribute. Now im getting the disablelock error cause of the unsigned applet. Do I just comment out the code anywhere where System.getProperty() appears in the files that you specified and then update the JAR Archive?? Is it possible you could show me one of the hacked files so that I know what I'm modifying? Does anyone else know if there is another way of doing this without having to hack the source code? Many thanks. Simon - Original Message - From: Jon Schuster [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, August 21, 2004 2:08 AM Subject: Re: Lucene Search Applet I have Lucene working in an applet and I've seen this problem only when the jar file really was not available (typo in the jar name), which is what you'd expect. It's possible that the classpath for your application is not the same as the classpath for the applet; perhaps they're using different VMs or JREs from different locations. Try referencing the Lucene jar file in the archive attribute of the applet tag. Also, to get Lucene to work from an unsigned applet, I had to modify a few classes that call System.getProperty(), because the properties that were being requested were disallowed for applets. I think the classes were IndexWriter, FSDirectory, and BooleanQuery. --Jon On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote: Im a new Lucene User and I'm not too familiar with Applets either but I've been doing a bit of testing on java applet security and if im correct in saying that applets can read anything below there codebase then my problem is not a security restriction one. The error is reading java.lang.NoClassDefFoundError and the classpath is set as I have it working in a Swing App. Does someone actually have Lucene working in an Applet? Can it be done?? Please help. Thanks Simon - Original Message - From: Terry Steichen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 4:17 PM Subject: Re: Lucene Search Applet I suspect it has to do with the security restrictions of the applet, 'cause it doesn't appear to be finding your Lucene jar file. Also, regarding the lock files, I believe you can disable the locking stuff just for purposes like yours (read-only index). Regards, Terry - Original Message - From: Simon mcIlwaine To: Lucene Users List Sent: Wednesday, August 18, 2004 11:03 AM Subject: Lucene Search Applet Im developing a Lucene CD-ROM based search which will search html pages on CD-ROM, using an applet as the UI. I know that theres a problem with lock files and also security restrictions on applets so I am using the RAMDirectory. I have it working in a Swing application however when I put it into an applet its giving me problems. It compiles but when I go to run the applet I get the error below. Can anyone help? Thanks in advance. Simon Error: Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory
Re: Lucene Search Applet
Thanks Erik for correcting me, I feel a bit stupid: I actually looked at the api to make sure that I wasn't in left field, but I trusted common-sense and stopped at the constructor ;) Should this property be changed in the next major release of lucene to org.apache...disableLuceneLocks? sv On Mon, 23 Aug 2004, Erik Hatcher wrote: On Aug 23, 2004, at 10:48 AM, Stephane James Vaucher wrote: I haven't used it, and I'm a little confused from the code: /** ... * pIf the system property 'disableLuceneLocks' has the String value of * true, lock creation will be disabled. */ public final class FSDirectory extends Directory { private static final boolean DISABLE_LOCKS = Boolean.getBoolean(disableLuceneLocks) || Constants.JAVA_1_1; ... I don't see a System.getProperty(String). :) check the javadocs for Boolean.getBoolean() It's by far one on of the dumbest and most confusing API's ever! (basically this does a System.getProperty(disableLuceneLocks) and converts it to a boolean. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Stupid question: Are you sure you have the right number of docs in your index? i.e. you're not adding the same document twice into or via your tmp index. sv On Thu, 19 Aug 2004, Rob Jose wrote: Paul Thank you for your response. I have appended to the bottom of this message the field structure that I am using. I hope that this helps. I am using the StandardAnalyzer. I do not believe that I am changing any default values, but I have also appended the code that adds the temp index to the production index. Thanks for you help Rob Here is the code that describes the field structure. public static Document Document(String contents, String path, Date modified, String runDate, String totalpages, String pagecount, String countycode, String reportnum, String reportdescr) { SimpleDateFormat showFormat = new SimpleDateFormat(TurbineResources.getString(date.default.format)); SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd); Document doc = new Document(); doc.add(Field.Keyword(path, path)); doc.add(Field.Keyword(modified, showFormat.format(modified))); doc.add(Field.UnStored(searchDate, searchFormat.format(modified))); doc.add(Field.Keyword(runDate, runDate==null?:runDate)); doc.add(Field.UnStored(searchRunDate, runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri ng(3,5))); doc.add(Field.Keyword(reportnum, reportnum)); doc.add(Field.Text(reportdescr, reportdescr)); doc.add(Field.UnStored(cntycode, countycode)); doc.add(Field.Keyword(totalpages, totalpages)); doc.add(Field.Keyword(page, pagecount)); doc.add(Field.UnStored(contents, contents)); return doc; } Here is the code that adds the temp index to the production index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); - Original Message - From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:16 AM Subject: Re: Index Size On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
From: Doug Cutting http://www.mail-archive.com/[EMAIL PROTECTED]/msg08757.html An index typically requires around 35% of the plain text size. I think it's a little big. sv On Wed, 18 Aug 2004, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping Indexes?
On Tue, 17 Aug 2004, Patrick Burleson wrote: Forward back to list. -- Forwarded message -- From: Patrick Burleson [EMAIL PROTECTED] Date: Tue, 17 Aug 2004 11:30:19 -0400 Subject: Re: Swapping Indexes? To: Stephane James Vaucher [EMAIL PROTECTED] Stephane, Thank you for the ideas. I'm going about implenting idea 1 (I like the idea of leaving the temp index around for recovery), but I have a question reguarding your original index. Do you just copy over the temp index and don't worry abou cleaning up the old index directory? Actually, I use a IndexWriter in overwrite mode on the master dir and merge the temp dir. This cleans up the old master. Right now I have my code deleting the files in the main index directory after telling the search controller to switch to the temp index. But by doing that, I need to manage existing searches and not break them while they are running. I also still run into the open files problem on Windows when trying to delete a file one of the searchers has open before it's closed. I used to way some time (~1 minute) for all searches on the old master to finish after redirecting to the temp dir, then I would switch to the new master. Thoughts? If you apply a lease-like contract with your searchers where they borrow a reference to a searcher and then hand it back to the manager, you can probably trace your open files. HTH, sv Patrick On Mon, 16 Aug 2004 18:22:20 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: I've tried two options that seem to work: 1) Have a singleton that is responsible that will control your searchers. This controller can temporarilly redirect your searchers to c:/temp/myindex, allowing you to copy you index to c:/myindex. After that process completes, your controller can tell your searchers to use c:/myindex, allowing you to then erase your temp index. If you index nightly, you can always *not* erase your tmp dir, your index process will do this automatically if you create your IndexWriter with the overwrite option. This way, you can have a backup index if there is a system failure at some point (like when you copy/move directories). 2) Use an incremental index. Regularly, I scan my files, see if there are modification/additions and update my master index. Removing from the master index, adding to a temp dir, then merging. I haven't seen any weirdness on windows with this process. HTH, sv On Mon, 16 Aug 2004, Patrick Burleson wrote: I've read in the docs about updating an index and its suggestion reguarding swapping out indexes with a directory rename. Here's my question, how to do this when searches are running live? Say I have a directory that holds the current valid index: C:\myindex and when I'm running my nightly process to generate the index, it gets temporarily indexed to: C:\temp\myindex How can I very quickly replace C:\myindex with C:\temp\myindex? I can't simply do a rename since C:\myindex will likely have open files. (Gotta love windows) And I can't delete all files in myindex, again because of the open files issue. Any ideas? Thanks, Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping Indexes?
On Tue, 17 Aug 2004, Patrick Burleson wrote: On Tue, 17 Aug 2004 13:17:10 -0400 (EDT), Stephane James Vaucher Actually, I use a IndexWriter in overwrite mode on the master dir and merge the temp dir. This cleans up the old master. I'm a bit of a Lucene newbie here, and I am trying to understand what you mean by merge the temp dir? IndexWriter.addIndexes() Do you copy your exiting Index to the temp location, then use the overwrite feature of IndexWriter to re-create the master, then what do you merge? Shouldn't the master index now have everything? What I mean is the following: 1) create tmp dir 2) redirect searchers to tmp dir 3) wait for everyone to use tmp dir (or other mecanism) 4) open indexwriter on master dir erasing it 5) merge tmp directory, using addIndexes() method 6) redirect searchers to new master dir I used to way some time (~1 minute) for all searches on the old master to finish after redirecting to the temp dir, then I would switch to the new master. I'm going to make this a setting, so that test won't have to wait a whole minute. But I think this is the cleanest solution without having to implement some sort of leaseing solution. Our searches should be fast and 1 minute is a long time. They should all be done by then. I used to reindex all my docs at 5h00AM, I probably could have waited 10 minutes since I didn't have users, it's all about requirements ;) Thanks again, Patrick sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping Indexes?
I've tried two options that seem to work: 1) Have a singleton that is responsible that will control your searchers. This controller can temporarilly redirect your searchers to c:/temp/myindex, allowing you to copy you index to c:/myindex. After that process completes, your controller can tell your searchers to use c:/myindex, allowing you to then erase your temp index. If you index nightly, you can always *not* erase your tmp dir, your index process will do this automatically if you create your IndexWriter with the overwrite option. This way, you can have a backup index if there is a system failure at some point (like when you copy/move directories). 2) Use an incremental index. Regularly, I scan my files, see if there are modification/additions and update my master index. Removing from the master index, adding to a temp dir, then merging. I haven't seen any weirdness on windows with this process. HTH, sv On Mon, 16 Aug 2004, Patrick Burleson wrote: I've read in the docs about updating an index and its suggestion reguarding swapping out indexes with a directory rename. Here's my question, how to do this when searches are running live? Say I have a directory that holds the current valid index: C:\myindex and when I'm running my nightly process to generate the index, it gets temporarily indexed to: C:\temp\myindex How can I very quickly replace C:\myindex with C:\temp\myindex? I can't simply do a rename since C:\myindex will likely have open files. (Gotta love windows) And I can't delete all files in myindex, again because of the open files issue. Any ideas? Thanks, Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: boost keywords
Other indexing strategies: - AFAIK, you could probably cheat by multiplying the number of tokens in headers thus affecting the scoring. For example: h1hello world/h1 p foo bar /p content - hello world hello world foo bar This is not very tweekable though. - As Tate suggests, you can also use multiple fields and apply your search on all of them: h1hello world/h1 p foo bar /p content- hello world foo bar headers- hello world or even h1hello world/h1 h2 foo bar /h2 content- hello world foo bar header1- hello world header2- foo bar The result of this is that you can fine-grained control over different fields. At this point, you can boost at indexing or at search time. I personnaly opt for search time because it is more open for tweeking as oposed to reindexing everything whenever you want to change a boost factor. As for the complexities that Tate mentions for query parsing, he's right that it's a pain when using the built-in query parser, but you can always use the api directly to build whatever queries you need. HTH, sv On Fri, 13 Aug 2004, Tate Avery wrote: Well, as far as I know you can boost 3 different things: - Field - Document - Query So, I think you need to craft a solution using one of those. Here are some possibilities for each: 1) Field - make a keyword field which is alongside your content field - boost your keyword field during indexing - expand user queries to search 'content' and 'keywords' 2) Document - I don't really think this one helps you in anyway 3) Query - Scan a user query and selectively boost words that are known keywords - This requires a keyword list and is not really scalable That is all that comes to mind, at first glance. So, IMO, the winner IS #1. For example: Field _headline = Field.Text(headline, ...); _headline.setBoost(3); Field _content = Field.Text(content, ...); _document.addField(_headline); _document.addField(_content); But, the tricky part is modifying queries to use both fields. If a user enters virus, it is easy (i.e. content:(virus) OR headline:(virus)). But, it quickly gets more complex with more complex queries (especially boolean queries with AND and such ... you probably would need something roughly like this: a AND b = content:(a AND b) OR headline:(a AND b) OR (content:a AND headline:b) OR (headline:a AND content:b) and so on). That's my 2 cents. T -Original Message- From: news [mailto:[EMAIL PROTECTED] Behalf Of Leos Literak Sent: Friday, August 13, 2004 8:52 AM To: [EMAIL PROTECTED] Subject: Re: boost keywords Gerard Sychay napsal(a): Well, there is always the Lucene wiki. There's not a patterns page per se, but you could start one.. of course I could. If I had something to add :-) but back to my issue. no reaction? So much people using Lucene and no one knows? I would be gratefull for any advice. Thanks Leos - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search exception in servlet!Please help me
What is the exception? Is hits null or the index (i) out of bounds? sv On Tue, 3 Aug 2004, xuemei li wrote: hi,all, I am using lucene to search.When I use console to run my code it works fine.But after I put my code to a servlet.It will throw exception.Here is my exception code: Document doc= hits.doc(i);--exception But I can use the following code to get the hits.length() value. out.println(centerpThere are +hits.length()+ matches for the word you have entered !/p/center); What's the problem?Any reply will be appreciated. thanks, Xuemei Li - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Setting up the index directory on tomcat
Assuming you are using a FSDirectory and have the appropriate permissions, yup. sv On Thu, 29 Jul 2004, Ian McDonnell wrote: Is this done simply by saying: String indexDirectory = /path of directory you want index to be stored in Ian _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: continous index update
I don't know if this helps, but this is what I do. I believe this is correct, but I have just finished impl and haven't tested it fully: - keep referrence to valid searcher - open a reader on the old index - open a writer to a tmp Directory (RAM of FS) - find removed/modified files, remove from reader and add (if update, or new document) to writer - close reader, open writer on master dir merge with tmp dir - update searcher reference This seems to work while there is concurrent requests, but I need to be more thorough. HTH, sv On Wed, 28 Jul 2004, jitender ahuja wrote: Hi all, I am trying to make an automatic index update file based o a background thread, but it gives errors in deleting the existing index, if (only if) the server accesses the index at the same time or has once accessed it and even if a different request is posed, i.e. for a different index directory or a different job, it makes no difference. Can anyone tell that in such a continous update scenario, how the old index can be updated as I feel deletion is a must of the earlier contents so as to get the new contents in place. Regards, Jitender - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: When does IndexReader pick up changes?
IIRC, if you use a searcher, changes are picked up right away. With a reader, I would expect it should react the same way. disclaimerI'm not a lucene guru, I might be wrong/disclaimer Where I'm less sure is with a FSDirectory, as it uses an internal RAMDirectory. If two separate processes (within the same classloader, FS with same paths are reused) use different FSDirectories, you might notice a flushing behaviour. sv On 28 Jul 2004 [EMAIL PROTECTED] wrote: Hi, Does anyone know if the IndexWriter has to be closed for an IndexReader to pick up the changes? Thanks. --- Lucene Users List [EMAIL PROTECTED] wrote: Hi, If I do this: - open index writer - add document - open reader - search with reader - close reader - close writer Will the reader pick up the document that was added to the index since it was opened after the document was added? Or will it only pick up changes that occur after the index writer is closed? Thanks for the help! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory Requirements
On Thu, 13 May 2004, Matt Quail wrote: do you know of any method to reduce the memory consumption of lucene when searching? It depends on the complexity of the search, I think. Also, I belive scoring might use more memory than the search itself (can anyone confirm this?). For example, I often use the HitCollector interface (and a BitSet) for queries where I am not interested in the score. Apart from that, I'm not aware of any other methods for reducing the memory consumption. Avoid prefix queries and wildcards, since they can be rewritten into large boolean queries. You can limit the rewritting with a maximum number of clauses in a BooleanQuery to prevent requests from taking too much memory. =Matt Sascha Ottolski wrote: Am Donnerstag, 13. Mai 2004 12:56 schrieb Matt Quail: I noticed that most users have +- 1G of RAM to run Lucene. Does anyone have experiences running it on a 128MB or 256MB machine? I regularly test my app that uses Lucene by passing -Xmx8m to the JVM; this is on a box with 1G of ram, but the JVM never more than 8M. My app runs fine (though there is a little more garbage collection activity). do you know of any method to reduce the memory consumption of lucene when searching? I've just increased from 400 to --Xmx500m, since sometimes OutOfMemoryExceptions occured (running suns java 1.4.2_04 and lucene-1.4rc3 on an index with ca. 18 Mio entries, building an index (including stored-only-fields) of 9 GB). I've seen that there are ways to limit the memory need for indexing and optimizing(?) by reducing the MergeFactor, but that doesn't seem to apply for searching :-( Thanks, Sascha =Matt sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Analysis of wildcard queries
I've seen this: http://www.jguru.com/faq/view.jsp?EID=538312 I've seen in the code that there is a method to set lowercasing, but I need to remove accentuated chars as well. Any suggestions as to which is preferable, preprocessing the input or subclassing a QueryParser and redefining getWildcardQuery? cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Range searches for numbers
Quick reference: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields If you are stuck, you can always encode the long in a string format (the date formatter in lucene might do this already). Or even, you could also treat it like a date and use your long like a date filter. HTH, sv On 6 May 2004 [EMAIL PROTECTED] wrote: Hi, What's the best way to store numbers for range searching? If someone has some info about this I'd love to see it. This is my current plan: When I convert the number to a string I will zero pad it so range searches work. The conversions will be like this for integers: 1 to 101 2 to 102 1000 to 1001000 I'm just adding a 1 to the start of the string (or adding 10). This is so negative numbers work too! They will just be subtracted from a long (10): -1 to 099 -2 to 098 -1000 to 0999000 This works great for range searches. But how do I convert negative longs? I can't subtract 100 from a long can I? It too big to fit in another long. Any advice is appreciated! -Reece - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Understanding Boolean Queries
On Thu, 29 Apr 2004, Tate Avery wrote: Hello, I have been reviewing some of the code related to boolean queries and I wanted to see if my understanding is approximately correct regarding how they are handled and, more importantly, the limitations. You can always submit requests for enhancements in bugzilla, so as to keep track this issue. Here is what I have come to understand so far: 1) The QueryParser code generated from javacc will parse my boolean query and determine for each clause whether or not is 'required' (based on a few conditions, but, in short, whether or not it was introduced or followed by 'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT'). Your usage seems pretty particular, why are you using the javacc QueryParser? 2) As my BooleanQuery is being constructed, it will throw a BooleanQuery.TooManyClauses exception if I exceed BooleanQuery.maxClauseCount (which defaults to 1024). It's configurable through sys properties or by BooleanQuery.setMaxClauseCount(int maxClauseCount) 3) The maxClauseCount threshold appears not to care whether or not my clauses are 'required' or 'prohibited'... only how many of them there are in total. 4) My BooleanQuery will prepare its own Scorer instance (i.e. BooleanScorer). And, during this step, it will identify to the scorer which clauses are 'required' or 'prohibited'. And, if more than 32 fall into this category, a IndexOutOfBoundsException (More than 32 required/prohibited clauses in query.) is thrown. That's as far as I got. Now, I am a bit confused at this point. Does this mean I can make a boolean query consisting of up to 1024 clauses as long as no more than 32 of them are required or prohibited? This doesn't seem right. So, am I missing something in the way I am understanding this. I am (as you may have guessed) generating large boolean queries. And, in some rare cases, I am receiving the exception identified in #4 (above). So, I am trying to figure out whether or not I need to change/filter my queries in a special way in order to avoid this exception. And, in order to do this, I want to understand how these queries are being handled. Finally, is there something related to the query syntax that could be my mistake? For example, what is the difference between: A B AND C D AND D E ... and... (A B) AND (C D) AND (D E) ... could that be the crux of it? I can't help you here, and the doc seems rather thin (or nonexistent for this class). I don't know the relation between the query and how the scorer will process it. Sorry I can't be of assistance, sv Thank you for your time, Tate Avery - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Understanding Boolean Queries
Hi Tate, Forgot to ask, what version of Lucene? (IIRC, = 1.2, means no maxClauseCount) sv On Thu, 29 Apr 2004, Tate Avery wrote: Thank you for the response. I am not using the QueryParser directly... it was just part of my overall understanding of how this exception is coming about. Same thing, essentially, with the maxClauseCount. Here is some code to illustrate what is confusing me and what I am trying to ascertain: int _numClauses = XXX; boolean _required = XXX; // 3 examples of these var settings below BooleanQuery _query = new BooleanQuery(); for (int _i = 0; _i _numClauses; _i++) { _query.add( new BooleanClause( new TermQuery(new Term(body, term + _i)), _required, false)); } Hits _hits = new IndexSearcher(INDEX_DIR).search(_query); 1) With _numClauses= and _required=false (for example), I have no problems. (This is confusing since is more than maxClauseCount... but I won't complain). 2) With _numClauses=32 and _required=true, I also have no problems. 3) With _numClauses=33 and _required=true, I get java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. as a runtime exception. So, I guess I am trying to ask the following: Is a query like (T1 AND T2 AND ... AND T32 AND T33) just completely illegal for Lucene? OR is there some way to extend this limit? OR am I missing something that is clouding my understanding? Thanks, Tate -Original Message- From: Stephane James Vaucher [mailto:[EMAIL PROTECTED] Sent: Thursday, April 29, 2004 1:10 PM To: Lucene Users List; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Understanding Boolean Queries On Thu, 29 Apr 2004, Tate Avery wrote: Hello, I have been reviewing some of the code related to boolean queries and I wanted to see if my understanding is approximately correct regarding how they are handled and, more importantly, the limitations. You can always submit requests for enhancements in bugzilla, so as to keep track this issue. Here is what I have come to understand so far: 1) The QueryParser code generated from javacc will parse my boolean query and determine for each clause whether or not is 'required' (based on a few conditions, but, in short, whether or not it was introduced or followed by 'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT'). Your usage seems pretty particular, why are you using the javacc QueryParser? 2) As my BooleanQuery is being constructed, it will throw a BooleanQuery.TooManyClauses exception if I exceed BooleanQuery.maxClauseCount (which defaults to 1024). It's configurable through sys properties or by BooleanQuery.setMaxClauseCount(int maxClauseCount) 3) The maxClauseCount threshold appears not to care whether or not my clauses are 'required' or 'prohibited'... only how many of them there are in total. 4) My BooleanQuery will prepare its own Scorer instance (i.e. BooleanScorer). And, during this step, it will identify to the scorer which clauses are 'required' or 'prohibited'. And, if more than 32 fall into this category, a IndexOutOfBoundsException (More than 32 required/prohibited clauses in query.) is thrown. That's as far as I got. Now, I am a bit confused at this point. Does this mean I can make a boolean query consisting of up to 1024 clauses as long as no more than 32 of them are required or prohibited? This doesn't seem right. So, am I missing something in the way I am understanding this. I am (as you may have guessed) generating large boolean queries. And, in some rare cases, I am receiving the exception identified in #4 (above). So, I am trying to figure out whether or not I need to change/filter my queries in a special way in order to avoid this exception. And, in order to do this, I want to understand how these queries are being handled. Finally, is there something related to the query syntax that could be my mistake? For example, what is the difference between: A B AND C D AND D E ... and... (A B) AND (C D) AND (D E) ... could that be the crux of it? I can't help you here, and the doc seems rather thin (or nonexistent for this class). I don't know the relation between the query and how the scorer will process it. Sorry I can't be of assistance, sv Thank you for your time, Tate Avery - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Combining text search + relational search
I'm a bit confused why you want this. As far as I know, but relational db searches will return exact matches without a mesure of relevancy. To mesure relevancy, you need a search engine. For your results to be coherent, you would have to put everything in the lucene index. As for memory consumption, for searching, if the index is on disk, then the memory footprint depends on the type of queries you use. For indexing, it depends if you use tmp RAMDirectory to do merges, otherwise, memory consumption is minimal. HTH sv On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote: I need to somehow aloow users to do a text search and query relational database attributes at the same time. The attributes are basically metadata about the documents that the text search will be perfomed on. I have the text of the documents indexed in Lucene. Does anyone have any advice or examples. I also need to make sure I don't garble up all the memory on our server Thanks Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: status of LARM project
I suggest you look at: http://www.manageability.org/blog/stuff/open-source-web-crawlers-java From what I know of nutch, it's meant as the basic for a competitor to the big search engines (i.e. google). For a small web site, it might be overkill especially if it requires you to build from CVS (unless there are distributions). Note: I've got the book Programming Spiders, Bots and Aggregators in Java, it describes spiders using a project called: j-spider http://sourceforge.net/projects/j-spider/ It could probably be adapted for your needs. HTH, sv On Wed, 28 Apr 2004, Kelvin Tan wrote: As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that Clemens got a job which wasn't supportive of his continued development on LARM. AFAIK there aren't any other active developers of LARM (at least at the time it branched off to SF). Otis recently posted to use Nutch instead of LARM. Kelvin On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said: Hi I have look at LARM website and I get different results http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages It says that development has stopped for this project. LARM hosted on sourceforge. The last message was dated 2003 in the mailing list. Is it still supported and active? LARM hosted on apache. It says the project is moved to sourceforge. Any one here who is active in LARM can comment on the status? Regards Sebastian Ho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Segments file get deleted?!
I would have to agree with Surya's diagnosis, can you give us details on your update process? Please include OS, and if there are some non-java processes (e.g. doing copies). cheers, sv On Mon, 26 Apr 2004, Nader S. Henein wrote: Can you give us a bit of background, we've been using Lucene since the first stable release 2 years ago, and I 've never had segments disappear on me, first of all can you provide some background on your setup and secondly when you say a certain period of time, how much time are we talking about here and does that interval coincide with your indexing schedule, because you may have the create flag on the Indexer set to true so it simply recreates the index at every update and deleted whatever was there, of course if there are no files to index at any point it will just give you a blank index. Nader Henein -Original Message- From: Surya Kiran [mailto:[EMAIL PROTECTED] Sent: Monday, April 26, 2004 7:48 AM To: [EMAIL PROTECTED] Subject: Segments file get deleted?! Hi all, we have implemented our portal search using Lucene. It works fine. But after a certain period of time Lucene segments file get deleted. Eventually all searches fails. Anyone can guess where the error could be. Thanks a lot. Regards Surya. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Jakarta Lucene Wiki] Updated: NewFrontPage
I don't know what you think of the NewFrontPage, but if you like it, I could do a switch, renaming old FrontPage to OldFrontPage and the new one to FrontPage. Also, if anyone knows how to do this, it would be appreciated. I haven't figured out yet how to rename/destroy pages (are there permissions in MoinMoin?). Amongst other things, the doc says there should be an action=DeletePage. cheers, sv On Mon, 26 Apr 2004 [EMAIL PROTECTED] wrote: Date: 2004-04-26T07:38:29 Editor: StephaneVaucher [EMAIL PROTECTED] Wiki: Jakarta Lucene Wiki Page: NewFrontPage URL: http://wiki.apache.org/jakarta-lucene/NewFrontPage Added link to PoweredBy page Change Log: -- @@ -17,6 +17,7 @@ || IntroductionToLucene || Articles and Tutorials introducing Lucene || || OnTheRoad || Information on presentations and courses || || InformationRetrieval || Articles and Tutorials on information retrieval || + || PoweredBy || Link to projects using Lucene || || [LuceneFAQ]|| The Lucene FAQ || || HowTo|| Lucene HOWTO's : small tutorials and code snippets || || [Resources]|| Contains useful links || - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Adding duplicate Fields to Documents
From my experience (that is little experience;)), fields that are not tokenised, are stored separately. Someone more qualified can surely give you more details. You can look at your index with Luke, it might be insightful. sv On Thu, 22 Apr 2004, Gerard Sychay wrote: Hello, I am wondering what happens when you add two Fields with same names to a Document. The API states that if the fields are indexed, their text is treated as though appended. This much makes sense. But what about the following two cases: - Adding two fields with same name that are indexed, not tokenized (keywords)? E.g. given (field_name, keyword1) and (field_name, keyword2), would the final keyword field be (field_name, keyword1keyword2)? Seems weird.. - Adding two fields with same name that are stored, but not indexed and not tokenized (e.g. database keys)? Are they appended (which would mess up the database key when retrieved from the Hit)? Thanks, Gerard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searcher not aware of index changes
This is not normal behaviour. Normally using a new IndexSearcher should reflect the modified state of your index. Could you post a more informative bit of code? sv On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote: Hi! My Searcher's instance it not aware of changes to the index. I even create a new instance but it seems only a complete restart does help(?): indexSearcher = new IndexSearcher(IndexReader.open(index)); Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searcher not aware of index changes
Normally the code should work, iif your you don't keep references to the old Searcher (and not try cacheing it). Make sure you aren't doing this by mistake. For the design of your facade, you could always implement Searchable and do the delegation to the up-to-date instance of IndexSearcher. Quick comment: you should call .close() on your searcher before removing the reference. If this causes exceptions in future searches, it would indicate incorrect cacheing. HTH, sv On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote: On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote: This is not normal behaviour. Normally using a new IndexSearcher should reflect the modified state of your index. Could you post a more informative bit of code? BTW Why can't Lucene care for it itself? Well, according to my logging it does create a new instance. I use only one instance of SessoinFacade: public class SearchFacade extends Observable { protected class IndexObserver implements Observer { private final Log log = LogFactory.getLog(getClass()); public Searcher indexSearcher; public IndexObserver() { newSearcher(); // init } public void update(Observable o, Object arg) { log.debug(Index has changed, creating new Searcher ); newSearcher(); } private void newSearcher() { try { indexSearcher = new IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN)); } catch (IOException e) { log.error(Could not instantiate searcher: + e); } } public Searcher getIndexSearcher() { return indexSearcher; } } private IndexObserver indexObserver; public SearchFacade() { addObserver(indexObserver = new IndexObserver()); } public void createIndex() { ... setChanged(); // index has changed notifyObservers(); } public Hits search(String query) { Searcher searcher = indexObserver.getIndexSearcher(); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what web crawler work best with Lucene?
How big is the site? I mostly use an inhouse solution, but I've used HttpUnit for web scrapping small sites (because of its high-level api). Here is a hello world example: http://wiki.apache.org/jakarta-lucene/HttpUnitExample For a small/simple site, small modifications to this class could suffice. IT WILL NOT function on large sites because of memory problems. For larger sites, there are questions like: - memory: For example, spidering all links on every page can lead to visiting too many links. Keeping all visited links in memory can be problematic - noise If you get every page on your web site, you might be adding noise to the search engine. Spider navigation rules can help out, like saying that you should only follow links/index documents of a specific form like www.mysite.com/news/article.jsp?articleid=xxx - speed: Too much speed can be bad if you doing 100 hits/sec on a site could hurt it (especially if it's not you who are the webmaster) Too little speed can be bad if you want to make sure you quickly get new pages. - categorisation: You might want to separate information in your index. For example, you might want a user to do a search in the documentation section or in the press release section. This categorisation can be done by specifying sections to the site, or a subsequent analysis of available docs. -up-to-date information You'll want to think of your update schedule, so that if you add a new page, it gets indexed quickly. This problem also occurs when you modify an existing page, you might want the modification to be detected rapidly. HTH, sv On Thu, 22 Apr 2004, Tuan Jean Tee wrote: Have anyone implemented any open source web crawler with Lucene? I have a dynamic website and are looking at putting in a search tools. Your advice is very much appreciated. Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using Runtime.exec to extract text [Was: Bridge with OO]
In case you don't know. Using Runtime.exec() on windows, you need to consume the output streams of the application will block. This is not the case on linux. http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html In short: Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock. HTH, sv On Tue, 20 Apr 2004, Argyn wrote: I've the same requirement. I used antiword, xlhtml and ppthtml on win2k. I called them with Runtime.exec(). There are still problems: all three hang up sometimes. Otherwise, it worked. I indexed several hunderds of thousands files in development mode. I never got into production. Argyn On Mon, 19 Apr 2004 16:53:41 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: Actually, the objective would be to use OO to extract text from MSOffice formats. If I read your code correctly, your code should only work with OO as the docs are in xml. Thanks for the code for OO docs through, sv On Mon, 19 Apr 2004, Mario Ivankovits wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bridge with OpenOffice
I'll make a copy of the code available on the wiki before it disappears off the Web. Now for some info on using OO on a production system: http://www.oooforum.org/forum/viewtopic.php?t=2913highlight=jurt summary src=Web, not my experienceOO works well (but is slow), but is not multi-threaded (the communication bridge is)./summary Quotes from end of 2003: Kai Sommerfeld from Sun wrote: Quote: The answer is not a simple 'yes' or 'no'. It's more: 'partly'. There are parts of the OOo API that are threadsafe, others are not. Newer components are generally threadsafe. Components thare are mainly wrappers for old Office code are mostly not. A main problem is that we cannot state for sure which components are actually thread safe and which are not. It's as worse as I say it here. We're trying to solve the multithreading issues for one of the next major releases of OOo. But this is definitely not an easy task, especially, since rewriting all non-threadaware code is simply not an option because of missing developer resources. Juergen Schmidt from Sun wrote: Quote: If you want to use OO in a safe way you shouldn't use it multi threaded. But we want to improve the server functionality of OO in genral so that your described scenario should be possible. Sorry, but currently you have to workaround this in your own application and you should use OO single threaded. But as i said we are working on this feature. Niklas Nebel from Sun who seem to have success with some code running successfully as multithreaded, wrote: Quote: The document API functions use the SolarMutex, so you should be able to use them from multiple threads without problems (with one call blocking the next, of course). Listener callbacks might be a problem if handled by different threads, but at least for the spreadsheet API I'm not aware of any other problems. Don't forget that every API call over a connection to a running office is multi-threaded, as the connection is handled by a different thread from office user interactions. sv On Mon, 19 Apr 2004, Magnus Johansson wrote: Yes I have tried it and it seems to work ok. I haven't really used it in a production environment however. There was some code here http://www.gzlinux.org/docs/category/dev/java/doc2txt.pdf it is however not there anymore, Google HTML version is however avaialble at http://66.102.9.104/search?q=cache:549doYEZTD4J:www.gzlinux.org/docs/category/dev/java/doc2txt.pdf+Appending+the+favoured+extension+to+the+origin+document+namehl=enie=UTF-8 /magnus Anyone try what Joerg suggested here? http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231 sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bridge with OpenOffice
Actually, the objective would be to use OO to extract text from MSOffice formats. If I read your code correctly, your code should only work with OO as the docs are in xml. Thanks for the code for OO docs through, sv On Mon, 19 Apr 2004, Mario Ivankovits wrote: Stephane James Vaucher wrote: Anyone try what Joerg suggested here? http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231 Dont know what you would like to do, but if you simply would like to extract text, you could simply try this sniplet: ---snip--- JarFile jar = new JarFile(file, false); ZipEntry entry = jar.getEntry(content.xml); if (entry == null) { throw new IOException(content.xml missing in file: + file); } InputStream is = jar.getInputStream(entry); XMLReader xr = XMLReaderFactory.createXMLReader(org.apache.crimson.parser.XMLReaderImpl); xr.setEntityResolver(new EntityResolver() { public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException { if (systemId.toLowerCase().endsWith(.dtd)) { StringReader stringInput = new StringReader( ); return new InputSource(stringInput); } else { return null; } } }); final StringBuffer sbText = new StringBuffer(10240); xr.setContentHandler(new ContentHandler() { public void skippedEntity(String name) throws SAXException { } public void setDocumentLocator(Locator locator) { } public void ignorableWhitespace(char ch[], int start, int length) throws SAXException { } public void processingInstruction(String target, String data) throws SAXException { } public void startDocument() throws SAXException { } public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { if (qName.equals(text:p)) { if (sbText.length() 0 sbText.charAt(sbText.length() - 1) != '\n') { sbText.append('\n'); } } } public void endPrefixMapping(String prefix) throws SAXException { } public void characters(char ch[], int start, int length) throws SAXException { sbText.append(ch, start, length); } public void endElement(String namespaceURI, String localName, String qName) throws SAXException { } public void endDocument() throws SAXException { } public void startPrefixMapping(String prefix, String uri) throws SAXException { } }); InputSource source = new InputSource(is); source.setPublicId(); source.setSystemId(); xr.parse(source); System.err.println(TXT: + sbText.toString()); ---snip--- Ciao, Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bad link contrib page
Since it has been moved to the Sandbox, someone should remove the term highlighter reference. http://jakarta.apache.org/lucene/docs/contributions.html Miscellaneous - Term Highlighter cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bridge with OpenOffice
Anyone try what Joerg suggested here? http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231 sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Presentation in Mtl
Erik, I'll add a wiki page tomorrow for Lucene training and presentations, unless you beat me to it. sv BTW, I visited nofluffjuststuff.com, and I've noticed that highlighting is misspelled again ;) Apologies if I'm poking fun at an Americanism. On Thu, 15 Apr 2004, Erik Hatcher wrote: This is great to see! I've been presenting Lucene at the No Fluff, Just Stuff symposiums for a while and really enjoy doing so. I also presented it last year at the O'Reilly Open Source Conference with the toughest attendee possible, Doug Cutting himself. I'm continuing my Lucene presentations this year still on the NFJS tour, and also I will be presenting it at JavaOne. The JavaOne presentation is only one hour though, so it will be a very quick (yet techie) pass through what Lucene offers. Let's create a wiki page that lists all the venues for Lucene presentations and training. If you are in the US and near a city listed here http://www.nofluffjuststuff.com come on out! Erik On Apr 15, 2004, at 1:06 AM, Matt Quail wrote: I too gave a Lucene presentation to my local JUG (Canberra, Australia) last night. It also went over very well. Lucene totally rocks! =Matt Stephane James Vaucher wrote: Hi everyone, I did a presentation tonight in Montreal at a java users group metting. I've got to say that they were maybe 4 companies present that use Lucene and find it very useful and simple to use. It lead to the longuest discussion (positive that is) I having at the users' group. So I've got to tell the Lucene contributors GOOD JOB! I'll probably upload my ppt presentation (heavily based on existing tutorials) to the wiki, so you can comment it. cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Presentation in Mtl
Hi everyone, I did a presentation tonight in Montreal at a java users group metting. I've got to say that they were maybe 4 companies present that use Lucene and find it very useful and simple to use. It lead to the longuest discussion (positive that is) I having at the users' group. So I've got to tell the Lucene contributors GOOD JOB! I'll probably upload my ppt presentation (heavily based on existing tutorials) to the wiki, so you can comment it. cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Presentation in Mtl
Wow discussion Lucene in French for 2 1/2 hours has affected my english. Please ignore spelling mistakes ;), but don't ignore the spirit of the message. sv On Thu, 15 Apr 2004, Stephane James Vaucher wrote: Hi everyone, I did a presentation tonight in Montreal at a java users group metting. I've got to say that they were maybe 4 companies present that use Lucene and find it very useful and simple to use. It lead to the longuest discussion (positive that is) I having at the users' group. So I've got to tell the Lucene contributors GOOD JOB! I'll probably upload my ppt presentation (heavily based on existing tutorials) to the wiki, so you can comment it. cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
I'm actually pretty lazy about index updates, and haven't had the need for efficiency, since my requirement is that new documents should be available on a next working day basis. I reindex everything from scatch every night (400,000 docs) and store it in an timestamped index. When the reindexing is done, I alert a controller of the new active index. I keep a few versions of the index in case of a failure somewhere and I can always send a message to the controller to use an old index. cheers, sv On Tue, 13 Apr 2004, petite_abeille wrote: On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? My current strategy is as follow: (1) use a temporary RAMDirectory for ongoing updates. (2) perform a copy on write when flushing the RAMDirectory into the persistent index. The second step means that I create an offline copy of a live index before invoking addIndexes() and then substitute the old index with the new, updated, one. While this effectively increase the time it takes to update an index, it nonetheless reduce the *perceived* downtime for it. Thoughts? Alternative strategies? TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Simple spider demo
I'm wondering if there is interest for a simple spider demo. I've got an example of how to use HttpUnit to spider on a web site and have it index it on disk (only html page now). I can send it to the list if anyone is interested (it's one class, 200 loc). cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANN: Docco 0.3
Looks cool, but I've got a question: How do you handle symlinks on *nix? I think it's stuck in a loop When indexing my home dir, I see it indexing: /home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/... cheers, sv On Wed, 14 Apr 2004, Peter Becker wrote: Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your file system which you can then search for keywords. It can index plain text, HTML, XML and OpenOffice files and with the support of plugins others like PDF, DOC and XLS. This new version of Docco features a number of small enhancements: the diagram layout can be changed, printing and graphic export options have been added and some plugins have been updated. The new POI plugin should be able to index MS Word documents again (the old one broke with recent Java versions), the PDFbox plugin gets all the recent updates from the PDFbox project. Old plugins will still continue to work, though. You can find the updated files here: http://sourceforge.net/project/showfiles.php?group_id=21448 Note that you can now also use the export plugins to add more graphic export options. Enjoy! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple spider demo
I've uploaded it to the wiki: http://wiki.apache.org/jakarta-lucene/HttpUnitExample dislaimer It's not anywhere close to production quality, especially since it's based on a unit test framework. /disclaimer sv On Tue, 13 Apr 2004, Stephane James Vaucher wrote: I'm wondering if there is interest for a simple spider demo. I've got an example of how to use HttpUnit to spider on a web site and have it index it on disk (only html page now). I can send it to the list if anyone is interested (it's one class, 200 loc). cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: suitability of lucene for project
It could be part of you solution, but I don't think so. Let me explain: I've done this a few times something similar to what you describe. I use often use HttpUnit to get information. How you process it, it's up to you. If you want it to be indexed (searchable), you can use Lucene. If you want to extract structured (or semi-structured) information, use wrapper induction techniques (not Lucene). cheers, sv On 13 Apr 2004, Sebastian Ho wrote: hi all i am investigating technologies to use for a project which basically retrieves html pages on a regular basis(or whenever there are changes) and allow html parsing to extract specific information, and presenting them as links in a webpage. Note that this is not a general search engine kind of project but we are extracting clinical information from various website and consolidating them. Pls advise me whether Lucene can do the above and in areas where it cannot, suggestions to solutions will be appreciated. Thanks Sebastian Ho Bioinformatics Institute - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlight package
Hello all, The link to Mark Harwood's highlight package is down, anyone have any idea where his package would be available? cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric field data
Added to the wiki, can of course be removed if it's transfered to the FAQs. sv On Sun, 4 Apr 2004, Kevin A. Burton wrote: Stephane James Vaucher wrote: Hi Tate, There is a solution by Erik that pads numbers in the index. That would allow you to search correctly. I'm not sure about decimal, but you could always add a multiplier. Wonder if that should go in the FAQ... wiki... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric field data
Hi Tate, There is a solution by Erik that pads numbers in the index. That would allow you to search correctly. I'm not sure about decimal, but you could always add a multiplier. HTH, sv On Fri, 2 Apr 2004, Tate Avery wrote: Hello, Is there a way (direct or indirect) to support a field with numeric data? More specifically, I would be interested in doing a range search on numeric data and having something like: number:[1 TO 2] ... and not have it return 11 or 103, etc. But, return 1.5, for example. Is there any support in current and/or upcoming versions for this type of thing? Or, has anyone figured out a creative workaround to obtain the desired result? Thank you for any comments, Tate p.s. Ideally, I would be able to do equal, greater than, less than and these in combination with each other (i.e. ranges, greater than or equal to, etc.). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Nested category strategy
Another possibility is to add all combinations in a single field. addField(category, /Science/); addField(category, /Science/Medicine); addField(category, /Science/Foo); addField(category, /Biology); Your wildcard search should work, and you shouldn't have the problem with a search /Science/*. HTH, sv On Thu, 1 Apr 2004, Tate Avery wrote: Could you put them all into a tab-delimited string and store that as a single field, then use a TabTokenizer on the field to search? And, if you need to, do a .split(\t) on the field value in order to break them back up into individual categories. -Original Message- From: David Black [mailto:[EMAIL PROTECTED] Sent: Thursday, April 01, 2004 2:49 PM To: [EMAIL PROTECTED] Subject: Nested category strategy Hey All, I'm trying to figure out the best approach to something. Each document I index has an array of categories which looks like the following example /Science/Medicine/Serology/blood gas /Biology/Fluids/Blood/ etc. Anyway, there's a couple things I'm trying to deal with. 1. The fact that we have an undefined array size. I can't just shove these into a single field. I could explode them into multiple fields on the fly like category_1, category_2. etc. etc 2. The fact that a search will need to be performed like category: /Science/Medicine/* would need to return all items within that category. Thanks in advance to anyone who can give me some help here. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance of hit highlighting and finding term positions for a specific document
I agree with you that a highlight package should be available directly from the lucene website. To offer this much-desired feature, having a dependency on a personal web site seems a little weird to me. It would also force the community to support this functionality, which would seem appropriate. cheers, sv On Tue, 30 Mar 2004, Kevin A. Burton wrote: I'm playing with this package: http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. My question is whether it's hard to find a TermPosition for a given term in a given document rather than the whole index. IndexReader.termPositions( Term term ) is term specific not term and document specific. Also it seems that after all this time that Lucene should have efficient hit highlighting as a standard package. Is there any interest in seeing a contribution in the sandbox for this if it uses the index positions? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Demoting results
Mark, Thanks for the update, since I contributed the page, I was going to modify it (I don't want to force work on other. sv On Mon, 29 Mar 2004 [EMAIL PROTECTED] wrote: Hi Doug, Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my implementation :-) Unless anyone has a particularly good reason I'll remove the link to my code that Stephane put on the Wiki contributions page. I definitely find BoostingQuery very useful and would be happy to see it in Lucene core but I'm not sure its popular enough to warrant adding special support to the query parser. BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments and then querying it to get the best fragments. This is nice in theory but could fail to find anything if the query is of these forms: a AND b a b When the code that breaks a doc into sentence docs splits co-occuring a and b terms into seperate docs this would produce no match. I dont think there's an easy way round that so I'll stick to the current approach of scoring fragments simply based on terms found in the query. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is RangeQuery more efficient than DateFilter?
I've added some information contained on this thread on the wiki. http://wiki.apache.org/jakarta-lucene/DateRangeQueries If you wish to add more information, go right ahead, but since I added this info, I believe it's ultimately my responsibility to maintain it. sv On Mon, 29 Mar 2004, Kevin A. Burton wrote: Erik Hatcher wrote: One more point... caching is done by the IndexReader used for the search, so you will need to keep that instance (i.e. the IndexSearcher) around to benefit from the caching. Great... Damn... looked at the source of CachingWrapperFilter and it makes sense. Thanks for the pointer. The results were pretty amazing. Here are the results before and after. Times are in millis: Before caching the Field: Searching for Jakarta: 2238 1910 1899 1901 1904 1906 After caching the field: 2253 10 6 8 6 6 That's a HUGE difference :) I'm very happy :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Javadocs lucene 1.4
Are the javadocs available on the site? I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery) somewhere on the lucene website. I've subscribed to the users mailing list, but I've never got a feel for the new version. Is there any way for this to happen, or should I await 1.4-rc1? cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Demoting results
Mark, I've added a section in the wiki called: http://wiki.apache.org/jakarta-lucene/CommunityContributions and have added an entry for your message. If you want to edit the message, go for it. I believe that the wiki can support attached files if you want to upload there. cheers, sv On Sun, 28 Mar 2004 [EMAIL PROTECTED] wrote: I've found an elegant way of doing this now for all types of search - a new NegatingQuery class that takes any Query object in its constructor and selects all documents that DONT match and gives them a user-definable boost. The code is here: http://www.inperspective.com/lucene/demote.zip Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4 - lobby for final release
I'm personally a fan of a release small but often approach, but what are the new features available in 1.4 (a list would be nice, on the wiki perhaps)? Will there be interim builds available to try these new features out soon? There seem to be no nightly builds on: http://cvs.apache.org/builds/jakarta-lucene/nightly/ cheers, sv On Fri, 26 Mar 2004, Chad Small wrote: thanks Erik. Ok this is my official lobby effort for the release of 1.4 to final status. Anyone else need/want a 1.4 release? Does anyone have any information on 1.4 release plans? thanks, chad. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Fri 3/26/2004 1:25 PM To: Lucene Users List Cc: Subject: Re: too many files open error On Mar 26, 2004, at 1:33 PM, Chad Small wrote: Is this :) serious? This is open-source. I'm only as serious as it would take for someone to push it through. I don't know what the timeline is, although lots of new features are available. Because we have a need/interest in the new field sorting capabilities and QueryParser keyword handling of dashes (-) that would be in 1.4, I believe. It's so much easier to explain that we'll use a final release of Lucene instead of a dev build Lucene. Why explain it?! Just show great results and let that be the explanation :) If so, what would an expected release date be? *shrug* - feel free to lobby for it. I don't know what else is planned before a release. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4 - lobby for final release
I hope nobody minds, I've added a link on the wiki to the head of CHANGES.txt. I'm not sure if anyone is maintaining the wiki, if not, I can take a look at it. I could maybe rearrange things to look like sample-site: http://wiki.apache.org/avalon /sample-site Any comments? I'll probably just go ahead and do it and await critisism ;) cheers, sv On Fri, 26 Mar 2004, Erik Hatcher wrote: On Mar 26, 2004, at 3:32 PM, Stephane James Vaucher wrote: I'm personally a fan of a release small but often approach, but what are the new features available in 1.4 (a list would be nice, on the wiki perhaps)? Will there be interim builds available to try these new features out soon? There is a CHANGES.txt in the root of the jakarta-lucene CVS repository that stays pretty much current and accurate. I'm pasting it below for the 1.3 - CVS HEAD changes. There seem to be no nightly builds on: http://cvs.apache.org/builds/jakarta-lucene/nightly/ I guess at this time you will have to build it yourself from CVS. There is one show-stopper before we can release an RC1. We must fully convert to ASL 2.0 (meaning every single source file needs the license header as well as any other files that can be tagged with it). I know Otis has changed some files, but we need a full sweep. There have been some utilities posted in a committers area to facilitate this change more automatically if we want to use them. Erik excerpt from CHANGES.txt 1.4 RC1 1. Changed the format of the .tis file, so that: - it has a format version number, which makes it easier to back-compatibly change file formats in the future. - the term count is now stored as a long. This was the one aspect of the Lucene's file formats which limited index size. - a few internal index parameters are now stored in the index, so that they can (in theory) now be changed from index to index, although there is not yet an API to do so. These changes are back compatible. The new code can read old indexes. But old code will not be able read new indexes. (cutting) 2. Added an optimized implementation of TermDocs.skipTo(). A skip table is now stored for each term in the .frq file. This only adds a percent or two to overall index size, but can substantially speedup many searches. (cutting) 3. Restructured the Scorer API and all Scorer implementations to take advantage of an optimized TermDocs.skipTo() implementation. In particular, PhraseQuerys and conjunctive BooleanQuerys are faster when one clause has substantially fewer matches than the others. (A conjunctive BooleanQuery is a BooleanQuery where all clauses are required.) (cutting) 4. Added new class ParallelMultiSearcher. Combined with RemoteSearchable this makes it easy to implement distributed search systems. (Jean-Francois Halleux via cutting) 5. Added support for hit sorting. Results may now be sorted by any indexed field. For details see the javadoc for Searcher#search(Query, Sort). (Tim Jones via Cutting) 6. Changed FSDirectory to auto-create a full directory tree that it needs by using mkdirs() instead of mkdir(). (Mladen Turk via Otis) 7. Added a new span-based query API. This implements, among other things, nested phrases. See javadocs for details. (Doug Cutting) 8. Added new method Query.getSimilarity(Searcher), and changed scorers to use it. This permits one to subclass a Query class so that it can specify it's own Similarity implementation, perhaps one that delegates through that of the Searcher. (Julien Nioche via Cutting) 9. Added MultiReader, an IndexReader that combines multiple other IndexReaders. (Cutting) 10. Added support for term vectors. See Field#isTermVectorStored(). (Grant Ingersoll, Cutting Dmitry) 11. Fixed the old bug with escaping of special characters in query strings: http://issues.apache.org/bugzilla/show_bug.cgi?id=24665 (Jean-Francois Halleux via Otis) 12. Added support for overriding default values for the following, using system properties: - default commit lock timeout - default maxFieldLength - default maxMergeDocs - default mergeFactor - default minMergeDocs - default write lock timeout (Otis) 13. Changed QueryParser.jj to allow '-' and '+' within tokens: http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 (Morus Walter via Otis) 14. Changed so that the compound index format is used by default. This makes indexing a bit slower, but vastly reduces the chances of file handle problems. (Cutting) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Documentation and presentations
Erik, maybe Otis and yourself should slow down on development. You wouldn't want your book to discuss lucene-1.3 if you release a version 1.5 before it hits the stores... unless that's your master plan;) sv On Fri, 26 Mar 2004, Erik Hatcher wrote: So far so good, Stephane, on the wiki changes - looks good! As for our book - at this point, early summer seems like when it'll actually be on the shelves. By the end of April we should have mostly everything complete, reviewed, and entirely in the publishers hands. *ugh* - this process takes much longer than even exaggerated estimates. Erik On Mar 26, 2004, at 6:00 PM, Stephane James Vaucher wrote: Hello lucene community, I'll be presenting lucene at the GUJM (Java Users Group of Montreal), mid-April, could you send me references, articles, presentations not readily available on the lucene site (at http://jakarta.apache.org/lucene/docs/resources.html)? Otis or Erik, I'll mention that you have written a book on lucene. When will it be out? I'll also see if I can rearrange the wiki using the information you send me, and I'll contribute my presentation (in french). cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Wiki and news
On the wiki, I've looked up some reference for lucene community releases to put under News (http://wiki.apache.org/jakarta-lucene/LatestNews), if I've missed some, you can modify the page yourself (it's a wiki after all). sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Just found the rest of the thread. I'll shut up now ;) sv On Sun, 14 Mar 2004, Stephane James Vaucher wrote: Back from a weeks' vacation, so this reply is a little late, maybe out of order as well ;). Comment inline: On Tue, 9 Mar 2004, Kevin A. Burton wrote: Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Why impose implementation details in the constructor? Shouldn't the constructor use a Map (not a HashMap), a Set, or a String array? sv Does that make sense? This patch and attachment take care of this problem... It does make this class more complex than it needs to be... but 1/2 of the methods are deprecated. Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
Weird idea, how about transforming your long into a Date and using a DateFilter to use a ranged query? sv On Fri, 5 Mar 2004, Erik Hatcher wrote: Terms in Lucene are text. If you want to deal with number ranges, you need to pad them. 0001 for example. Be sure all numbers have the same width and zero padded. Lucene use lexicographical ordering, so you must be sure things collate in this way. Erik On Mar 5, 2004, at 11:46 AM, [EMAIL PROTECTED] wrote: On Friday 05 March 2004 15:42, Otis Gospodnetic wrote: Try with Field.Keyword. Ok, works. Another problem: Range searches don't work. id:(1 TO 1069421083284) does return only 1 hit - 1069421083284. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
On Fri, 5 Mar 2004 [EMAIL PROTECTED] wrote: On Friday 05 March 2004 18:01, Erik Hatcher wrote: 0001 for example. Be sure all numbers have the same width and zero padded. And what about a range like 100 TO 1000? You mean 0100 To 1000 or 100 to 0001000 ;) sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sys properties Was: java.io.tmpdir as lock dir .... once again
As I've stated in my earlier mail, I like this change. More importantly, could this become a standard way of changing configurations at runtime? For example, the default merge factor could also be set in this manner. sv On Wed, 3 Mar 2004, Michael Duval wrote: I agree with both the property name change and also making it static. Mike Doug Cutting wrote: Michael Duval wrote: I've hacked the code for the time being by updating FSDirectory and replaced all System.getProperty(java.io.tmpdir) calls with a call to a new method getLockDir(). This method checks for a lucene.lockdir prop before the java.io.tmpdir prop giving the end user a bit more flexibility in where locks are stored. In general, I support this change. Here is the method: /** Allow flexible locking directories - Michael R. Duval 3/02/04 */ private String getLockDir() { String lockDir; if ((lockDir = System.getProperty(lucene.lockdir)) == null) return System.getProperty(java.io.tmpdir); else return lockDir; } In particular, I have some quibbles. The property should be named something like org.apache.lucene.lockdir, not just lucene.lockdir. And there's no reason to look it up each time: it can just be a static. private static final String LOCK_DIR = System.getProperty(org.apache.lucene.lockdir, System.getProperty(java.io.tmpdir)); Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sys properties Was: java.io.tmpdir as lock dir .... once again
How about (looking big rather than small): - MaxClause from BooleanQuery (I know there has been discussions on the dev list, but I haven't been following it) - default commit_lock_name - default commit_lock_timeout - default maxFieldLength - default maxMergeDocs - default mergeFactor - default minMergeDocs - default write_lock_name - default write_lock_timeout I'm currently configuring parts of my app using sys properties, particularly the mergeFactor because my prod system has 2GB of RAM and is windows based and my dev machine has 256MB and is linux. If no one takes a crack at this, I'll see what I can do in 2 weeks, after my vacations. Cheers, sv On Wed, 3 Mar 2004, Doug Cutting wrote: Stephane James Vaucher wrote: As I've stated in my earlier mail, I like this change. More importantly, could this become a standard way of changing configurations at runtime? For example, the default merge factor could also be set in this manner. Sure, that's reasonable, so this would be something like: private static final int DEFAULT_MERGE_FACTOR = Integer.parseInt(System.getProperty(org.apache.lucene.mergeFactor,10)); In IndexWriter.java. What other candidates are there for this treatment? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.tmpdir as lock dir .... once again
I've done something similar to configure my merge factor (but it was outside my code), and am planning on setting the limit on boolean queries this way as well. I think it's pretty clean especially if you use org.apache.lucene.xxx properties with decent default values. Adding this feature could probably better document the hazards of use an index lock in a distributed system, considering many people like to know the implications of running lucene in a web (and potentially replicated) env. my 2c, sv On Tue, 2 Mar 2004, Otis Gospodnetic wrote: This looks nice. However, what happens if you have two Java processes that work on the same index, and give it different lock directories? They'll mess up the index. If you sell people coffee, they can always burn themselves. Might as well warn them. Should we try to prevent this by not offering this option, or should we offer it, document it well, and leave it up to the user to play by the rules or not? I'm leaning towards the latter, but I think some Lucene developers would be more conservative. Otis --- Michael Duval [EMAIL PROTECTED] wrote: Hello All, I've come across my first gotcha with the system property java.io.tmpdir as the lock directory. Over here at APS we run lucene in two different servlet containers on two different servers for both performance and security reasons. One container gives read access to the collection and the other is contantly updating the collection. The collection is NFS mounted from both servers. This worked fine until the lucene update 1.3. Now the lock files are being written to the temp dir's in each of the respective containers root dir's. This of course breaks the locking scheme. I could have changed the tmpdir prop to write files back into the collection directory but this would also pollute the tmpdir with other non-related files. My solution was as follows: I've hacked the code for the time being by updating FSDirectory and replaced all System.getProperty(java.io.tmpdir) calls with a call to a new method getLockDir(). This method checks for a lucene.lockdir prop before the java.io.tmpdir prop giving the end user a bit more flexibility in where locks are stored. Here is the method: /** Allow flexible locking directories - Michael R. Duval 3/02/04 */ private String getLockDir() { String lockDir; if ((lockDir = System.getProperty(lucene.lockdir)) == null) return System.getProperty(java.io.tmpdir); else return lockDir; } Hopefully a solution similar to this will make it in to one of the next distributions. Thanks and Cheers, Mike -- Michael R. Duval [EMAIL PROTECTED] E-Journal Programmer/Analyst The American Physical Society 1 Research Road Ridge, NY 11961 www.aps.org 631 591 4127 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Field boosting Was: Indexing multiple instances of the same field for each document
Slightly off topic to this thread, but how would adding different fields with the same name deal with boosts? I've looked at the javadoc and FAQ, but I think it's not a common use of this feature, any insight? E.G. Document doc = new Document(); Field f1 = Field.Keyword(fieldName, foo); f1.setBoost(1); doc.add(f1); Field f2 = Field.Keyword(fieldName, bar); f2.setBoost(2); doc.add(f2); Cheers, sv On Fri, 27 Feb 2004, Doug Cutting wrote: I think it's document.add(). Fields are pushed onto the front, rather than added to the end. Doug Roy Klein wrote: I think it's got something to do with Document.invertDocument(). When I reverse the words in the phrase, the other document matches the phrase query. Roy -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 27, 2004 4:34 PM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document On Feb 27, 2004, at 4:10 PM, Roy Klein wrote: Hi Erik, While you might be right in this example (using Field.Keyword), I can see how this would still be a problem in other cases. For instance, if I were adding more than one word at a time in the example I attached. I concur that it appears to be a bug. It is unlikely folks use Lucene like this too much though - there probably are not too many scenarios where combining things into a single String or Reader is a burden. I'm interested to know where in the code this oddity occurs so I can understand it more. I did a brief bit of troubleshooting but haven't figured it out yet. Something in DocumentWriter I presume. Erik Roy -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 27, 2004 2:12 PM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document Roy, On Feb 27, 2004, at 12:12 PM, Roy Klein wrote: Document doc = new Document(); doc.add(Field.Text(contents, the)); Changing these to Field.Keyword gets it to work. I'm delving a little bit to understand why, but it seems if you are adding words individually anyway you'd want them to be untokenized, right? Erik doc.add(Field.Text(contents, quick)); doc.add(Field.Text(contents, brown)); doc.add(Field.Text(contents, fox)); doc.add(Field.Text(contents, jumped)); doc.add(Field.Text(contents, over)); doc.add(Field.Text(contents, the)); doc.add(Field.Text(contents, lazy)); doc.add(Field.Text(contents, dogs)); doc.add(Field.Keyword(docnumber, 1)); writer.addDocument(doc); doc = new Document(); doc.add(Field.Text(contents, the quick brown fox jumped over the lazy dogs)); doc.add(Field.Keyword(docnumber, 2)); writer.addDocument(doc); writer.close(); } public static void query(File indexDir) throws IOException { Query query = null; PhraseQuery pquery = new PhraseQuery(); Hits hits = null; try { query = QueryParser.parse(quick brown, contents, new StandardAnalyzer()); } catch (Exception qe) {System.out.println(qe.toString());} if (query == null) return; System.out.println(Query: + query.toString()); IndexReader reader = IndexReader.open(indexDir); IndexSearcher searcher = new IndexSearcher(reader); hits = searcher.search(query); System.out.println(Hits: + hits.length()); for (int i = 0; i hits.length(); i++) { System.out.println( hits.doc(i).get(docnumber) + ); } pquery.add(new Term(contents, quick)); pquery.add(new Term(contents, brown)); System.out.println(PQuery: + pquery.toString()); hits = searcher.search(pquery); System.out.println(Phrase Hits: + hits.length()); for (int i = 0; i hits.length(); i++) { System.out.println( hits.doc(i).get(docnumber) + ); } searcher.close(); reader.close(); } public static void main(String[] args) throws Exception { if (args.length != 1) { throw new Exception(Usage: + test.class.getName() + index dir); } File indexDir = new File(args[0]); test(indexDir); query(indexDir); } } - - - - --- My results: Query: contents:quick contents:brown Hits: 2 1 2 PQuery: contents:quick brown Phrase Hits: 1 2 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field boosting Was: Indexing multiple instances of the same field for each document
Cheers, I index information in chunks. The reason for this is that I have an IR tool that returns information ordered by confidence rather than by the fields I index. I just add fields as they come, but I would be interested in knowing how other people deal with confidence. Following your answer, I can't add my confidence as boosts to the terms as I index them, do you have any suggestions? I'm guessing that I'll probably have to add multiple copies of my fields to simulate boosting. sv On Fri, 27 Feb 2004, Erik Hatcher wrote: On Feb 27, 2004, at 6:26 PM, Stephane James Vaucher wrote: Slightly off topic to this thread, but how would adding different fields with the same name deal with boosts? I've looked at the javadoc and FAQ, but I think it's not a common use of this feature, any insight? There is only one boost per field name. However, the effect is the multiplication of them all interestingly. So, in your example below, the boost of the fieldName is 2. Erik E.G. Document doc = new Document(); Field f1 = Field.Keyword(fieldName, foo); f1.setBoost(1); doc.add(f1); Field f2 = Field.Keyword(fieldName, bar); f2.setBoost(2); doc.add(f2); Cheers, sv On Fri, 27 Feb 2004, Doug Cutting wrote: I think it's document.add(). Fields are pushed onto the front, rather than added to the end. Doug Roy Klein wrote: I think it's got something to do with Document.invertDocument(). When I reverse the words in the phrase, the other document matches the phrase query. Roy -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 27, 2004 4:34 PM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document On Feb 27, 2004, at 4:10 PM, Roy Klein wrote: Hi Erik, While you might be right in this example (using Field.Keyword), I can see how this would still be a problem in other cases. For instance, if I were adding more than one word at a time in the example I attached. I concur that it appears to be a bug. It is unlikely folks use Lucene like this too much though - there probably are not too many scenarios where combining things into a single String or Reader is a burden. I'm interested to know where in the code this oddity occurs so I can understand it more. I did a brief bit of troubleshooting but haven't figured it out yet. Something in DocumentWriter I presume. Erik Roy -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 27, 2004 2:12 PM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document Roy, On Feb 27, 2004, at 12:12 PM, Roy Klein wrote: Document doc = new Document(); doc.add(Field.Text(contents, the)); Changing these to Field.Keyword gets it to work. I'm delving a little bit to understand why, but it seems if you are adding words individually anyway you'd want them to be untokenized, right? Erik doc.add(Field.Text(contents, quick)); doc.add(Field.Text(contents, brown)); doc.add(Field.Text(contents, fox)); doc.add(Field.Text(contents, jumped)); doc.add(Field.Text(contents, over)); doc.add(Field.Text(contents, the)); doc.add(Field.Text(contents, lazy)); doc.add(Field.Text(contents, dogs)); doc.add(Field.Keyword(docnumber, 1)); writer.addDocument(doc); doc = new Document(); doc.add(Field.Text(contents, the quick brown fox jumped over the lazy dogs)); doc.add(Field.Keyword(docnumber, 2)); writer.addDocument(doc); writer.close(); } public static void query(File indexDir) throws IOException { Query query = null; PhraseQuery pquery = new PhraseQuery(); Hits hits = null; try { query = QueryParser.parse(quick brown, contents, new StandardAnalyzer()); } catch (Exception qe) {System.out.println(qe.toString());} if (query == null) return; System.out.println(Query: + query.toString()); IndexReader reader = IndexReader.open(indexDir); IndexSearcher searcher = new IndexSearcher(reader); hits = searcher.search(query); System.out.println(Hits: + hits.length()); for (int i = 0; i hits.length(); i++) { System.out.println( hits.doc(i).get(docnumber) + ); } pquery.add(new Term(contents, quick)); pquery.add(new Term(contents, brown)); System.out.println(PQuery: + pquery.toString()); hits = searcher.search(pquery); System.out.println(Phrase Hits: + hits.length()); for (int i = 0; i hits.length(); i++) { System.out.println( hits.doc(i).get
Re: Incrementally updating and monitoring the index
On Fri, 13 Feb 2004 [EMAIL PROTECTED] wrote: Hi! Can Lucene incrementally update its index (i.e. balancing will a list of docs and removing those that are no more found)? Incremental updates (additions and deletions) are possible, but I'm not sure if I understand your question. Lucene holds its own instances of documents structured in text fields (not going into details here). There lucene documents are created and updated programatically, not automatically because lucene does not keep tabs on external documents. I'd like to monitor the index for certain queries/terms, i.e. I want to be notified if there are (new) hits for a list of terms each time after I add a document to the index - continously. Is this possibe? The index will contain several hundrets of thousands of documents and will be frequently accessed concurrently. Very possible, before adding a document, you can check (with the judicious use of an id) if it has already been added. If it hasn't, do your notification, but this requires programming. For concurrent write access, there is a lock, do you might want to use a singleton responsible for adding documents. TIA Timo HTH, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The First Parameter of the IndexWriter
You should probably take a look at the javadoc: http://jakarta.apache.org/lucene/docs/api/index.html As for where to store the index, you'll want to put it somewhere where all potential users can access it, as well as where there is enough space for your index. In a nutshell, you need to think of: - amount of storage required - permissions (e.g. if you need to access it from a app server with security restrictions) - access, on a shared HD or not - deployment, if for a product, then it should be included in your installation strategy, so you might use c:/Program Files/.../MyApp/index, or /usr/local/MyApp/index. On win, I personnally use my D drive in a path corresponding to d:/app-name/index. HTH, sv On Tue, 10 Feb 2004, Caroline Jen wrote: I am constructing a web site. I am learning the Lucene so that I can use it to search the database. I started with reading the Introdution In Text Indexing with Jakarta Apache Lucene at http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html and in the example given, it looks that I have to specify a directory for the first parameter of the IndexWriter (see below). String indexDir = System.getProperty (java.io.tmpdir, tmp) + System.getProperty (file.separator) + index-1; Analyzer analyzer = new StandardAnalyzer(); boolean createFlag = true; IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag); I have a record created and stored in a table in my database whenever a user submits his/her inputs. And I want to index that record. What should be the indexDir in my case? Should I follow the above example and use java.io.tmpdir? I sort of doubt it. Please advise. __ Do you Yahoo!? Yahoo! Finance: Get your refund fast by filing online. http://taxes.yahoo.com/filing.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]