Re: Query#rewrite Question
On Nov 10, 2004, at 9:51 PM, Satoshi Hasegawa wrote: Our program accepts input in the form of Lucene query syntax from the user, but we wish to perform additional tasks such as thesaurus expansion. So I want to manipulate the Query object that results from parsing. You may want to consider using an Analyzer to expand queries rather than manipulating the query object itself. My question is, is the result of the Query#rewrite method guaranteed to be either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a BooleanQuery, do all the constituent clauses also reduce to one of the above three classes? No. For example, look at the SpanQuery family. These do no explicit rewriting and thus are left as themselves. If not, what if the original Query object was the one that was obtained from QueryParser#parse method? Can I assume the above in this restricted case? I experimented with the current version, and the above seems to be positive in this version; I'm asking if this could change in the future. Thank you. I think we'll see QueryParser, or at least more sophisticated versions of it, emerge that support SpanQuery's. In fact, in our book, I created a subclas of QueryParser that overrides getFieldQuery and returns a SpanNearQuery in order to achieve ordered phrase searching. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
On Nov 11, 2004, at 1:47 AM, [EMAIL PROTECTED] wrote: Yes, I tried that too and it worked. The issue is that our Operations folks plan to install this on a pretty busy box and I was hoping that Lucene wouldn't cause issues if it only had a small slice of the CPU. I don't think that Lucene is causing the issue. I'd like to wait and see if others have opinions/suggestions on this issue. Again, what your example program is doing is unrealistic - you're hammering the filesystem and CPU by having infinite loops that do not sleep. If a minimal sleep works then I don't think you'll have to concern the operations folks with a bigger box. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug in the BooleanQuery optimizer? ..TooManyClauses
Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re[2]: Faster highlighting with TermPositionVectors (update)
Hello Mark. I'm just wondered about the following piece of code from your latest TokenSources class: public static TokenStream getAnyTokenStream(IndexReader reader,int docId, String field,Analyzer analyzer) throws IOException { TokenStream ts=null; TermFreqVector tfv=(TermFreqVector) reader.getTermFreqVector(docId,field); if(tfv!=null) { if(tfv instanceof TermPositionVector) { //read pre-parsed token position info stored on disk TermPositionVector tpv=(TermPositionVector) reader.getTermFreqVector(docId,field); ts=getTokenStream(tpv); } } //No token info stored so fall back to analyzing raw content if(ts==null) { ts=getTokenStream(reader,docId,field,analyzer); } return ts; } Isn't you called getTermFreqVector(docId,field) twice? Why not just call: if(tfv instanceof TermPositionVector) { ts=getTokenStream((TermPositionVector) tvf); } Max Friday, November 5, 2004, 12:25:13 AM, you wrote: m Having revisited the original TokenSources code it looks like one of the m optimisations I put in will fail if fields are stored with m non-contiguous position info (ie the analyzer has messed with token m position numbers so they overlap or have gaps like ..3,3,7,8,9,..). m I've now made the TokenSources code safe by default by assuming token m position values are not contiguous and should not be used for sorting. m For those who know what they are doing I have added a parameter to one m of the methods to turn the optimisation back on if they can guarantee m positions are contigous. m New code is at the same place: m http://www.inperspective.com/lucene/TokenSources.java m Cheers m Mark m - m To unsubscribe, e-mail: [EMAIL PROTECTED] m For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re[2]: Faster highlighting with TermPositionVectors (update)
Thanks, Max. Another schoolboy error in TokenSources.java :) More haste, less speed required on my part. I have updated my code and will post to website tonight. This change doesn't appear to have made a noticeable difference in performance but the code is cleaner. Cheers Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search scalability
If you load it explicitly, then all 800 MB will make it into RAM. It's easy to try, the API for this is super simple. Otis --- [EMAIL PROTECTED] wrote: Does it take 800MB of RAM to load that index into a RAMDirectory? Or are only some of the files loaded into RAM? --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, 100 parallel searches going against a single index on a single disk means a lot of disk seeks all happening at once. One simple way of working around this is to load your FSDirectory into RAMDirectory. This should be faster (could you report your observations/comparisons?). You can also try using ramfs if you are using Linux. Otis --- Ravi [EMAIL PROTECTED] wrote: We have one large index for a document repository of 800,000 documents. The size of the index is 800MB. When we do searches against the index, it takes 300-500ms for a single search. We wanted to test the scalability and tried 100 parallel searches against the index with the same query and the average response time was 13 seconds. We used a simple IndexSearcher. Same searcher object was shared by all the searches. I'm sure people have success in configuring lucene for better scalability. Can somebody share their approach? Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search scalability
Thanks a lot. I'll use RAMDirectory and post my results. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 9:09 AM To: Lucene Users List Subject: Re: Search scalability If you load it explicitly, then all 800 MB will make it into RAM. It's easy to try, the API for this is super simple. Otis --- [EMAIL PROTECTED] wrote: Does it take 800MB of RAM to load that index into a RAMDirectory? Or are only some of the files loaded into RAM? --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, 100 parallel searches going against a single index on a single disk means a lot of disk seeks all happening at once. One simple way of working around this is to load your FSDirectory into RAMDirectory. This should be faster (could you report your observations/comparisons?). You can also try using ramfs if you are using Linux. Otis --- Ravi [EMAIL PROTECTED] wrote: We have one large index for a document repository of 800,000 documents. The size of the index is 800MB. When we do searches against the index, it takes 300-500ms for a single search. We wanted to test the scalability and tried 100 parallel searches against the index with the same query and the average response time was 13 seconds. We used a simple IndexSearcher. Same searcher object was shared by all the searches. I'm sure people have success in configuring lucene for better scalability. Can somebody share their approach? Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Acedemic Question About Indexing
40 Million! Wow. Ok this is the kind of answer I was looking for. The site I am working on indexes maybe 1000 at any given time. I think I am ok with a single index. Thanks. - Original Message - From: Will Allen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 7:23 PM Subject: RE: Acedemic Question About Indexing I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Acedemic Question About Indexing
Could I ask how fast the search goes against this index, both for simple words and more advanced phrase and boolean searches? And is there something smart you have done to make this go fast, both on the infrastructure or the system it selves? Best regards, Gard Arneson Haugen Email : [EMAIL PROTECTED] Mobile: +47 93 05 01 91 Fax : +47 21 95 51 99 Magenta News AS - Møllergata 8, 0179 Oslo Will Allen wrote: I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTMLParser.getReader returning null
Hello; Things were working fine. I have been re-organizing my code to drop into QA when I noticed I was no longer getting search results for my HTML files. When I checked things out I confirmed I was still creating the Documents but realized no content was being indexed. HTMLParser parser = new HTMLParser(f); // Add the tag-stripped contents as a Reader-valued Text field so it will // get tokenized and indexed. doc.add(Field.Text(contents, parser.getReader())); System.out.println(The content is + doc.get(contents)); The SOP line above outputs a null where the contents used to be. Any seen this before? Thanks, Luke - Original Message - From: Will Allen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 11, 2004 1:59 PM Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Acedemic Question About Indexing
I have a servlet that instanciates a multisearcher on 6 indexes: (du -h) 7.2G./0 7.2G./1 7.2G./2 7.2G./3 7.2G./4 7.2G./5 43G . I recreate the index from scratch each month based upon a 50gig zip file with all of the 40 million documents. I wanted to keep my indexing speed as low as possible, without hurting search performace too much, as each searcher allocates a certain amount of memory proportional to the number of terms it has. A single large index has a lot of overlap in terms, so it needs less memory than multiple indexes. Anyway, for indexing, I am able to index ~100 documents per second. The total indexing process takes 2.5 days. I have a powerful machine with 2 hyperthreaded processors (linux sees 4 processors) and 1GB ram. I also have pretty fast SCSI disks. I perform no updates or deletes on my indexes. The indexing process equally divides the work amongst the indexers. The bottleneck of the indexing process is not memory or CPU, rather disk IO of 6 writers. If I had faster disks, I could create more indexers. -Original Message- From: Sodel Vazquez-Reyes [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 11:37 AM To: Lucene Users List Cc: Will Allen Subject: Re: Acedemic Question About Indexing Will, could you give more details about your architecture? -each time update o create new indexes -data stored at each index etc. because it is quite interesting, and I would like to test it. Sodel Quoting Luke Shannon [EMAIL PROTECTED]: 40 Million! Wow. Ok this is the kind of answer I was looking for. The site I am working on indexes maybe 1000 at any given time. I think I am ok with a single index. Thanks. - Original Message - From: Will Allen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 7:23 PM Subject: RE: Acedemic Question About Indexing I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml
RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
Yes, I understand all of this, but I don't want to set it to MaxInt, since it can easily lead to (even accidental) DoS attacks. What I'm saying is that there is no reason for the optimizer to expand wild* to more than 1024 variations when I search for somerareword AND wild*, since somerareword is only present in let's say 100 documents, so wild* should only expand to words beginning with wild in those 100 documents, then it should work fine with the default 1024 clause limit. But it doesn't, so I can choose between unuseable queries or accidental DoS attacks. --- Will Allen [EMAIL PROTECTED] wrote: Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query#rewrite Question
On Thursday 11 November 2004 03:51, Satoshi Hasegawa wrote: Hello, Our program accepts input in the form of Lucene query syntax from the user, but we wish to perform additional tasks such as thesaurus expansion. So I want to manipulate the Query object that results from parsing. My question is, is the result of the Query#rewrite method guaranteed to be either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a BooleanQuery, do all the constituent clauses also reduce to one of the above three classes? If not, what if the original Query object was the one that was obtained from QueryParser#parse method? Can I assume the above in this restricted case? I experimented with the current version, and the above seems to be positive in this version; I'm asking if this could change in the future. Thank you. In general, a Query should either rewrite to another query, or provide a Weight. During search, the Weight then provides a Scorer to score the docs. The only other type of query currently available is SpanQuery, which is a generalization of PhraseQuery. It does not rewrite and provides a Weight. However, the current QueryParser does not have support for SpanQuery. So, as long as the QueryParser does not support more than the current types of queries, and you only use the QueryParser to obtain queries, all the constituent clauses will reduce as you indicate above. SpanQuery could be useful for thesaurus expansion. The generalization it provides is that it allows nested distance queries. For example, in: word1 word2~2 word2 can expanded to: word2 or word3 word4~4 leading to a query that is not supported by the current QueryParser: word1 (word 2 or word3 word4~4)~2 SpanQueries can also enforce an order on the matching subqueries, but that is difficult to express in the current query syntax. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Thursday 11 November 2004 20:57, Sanyi wrote: What I'm saying is that there is no reason for the optimizer to expand wild* to more than 1024 variations That's the point: there is no query optimizer in Lucene. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
weird things in 1.4.2 build
Hi guys, Thanks for the fantastic mailing list. Where all the questions get answered. Guys I have upgraded my installation from 1.3.final to 1.4.2 and now when I try to index the files using IndexHTML the commnad just hangs on the prompt or would parse some 4 - 5 files and would simply hang. Any idea what could be the issue? TIA, -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
getting error message
Does anyone know what does the following error message mean? TIA. -H root cause java.lang.NullPointerException at org.apache.jsp.searchResults_jsp._jspService(searchResults_jsp.java:627) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:210) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:247) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:256) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2417) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643) at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java:171) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:641) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:172) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:641) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:193) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:781) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:549) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:589) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:666) at java.lang.Thread.run(Thread.java:534) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene : avoiding locking
Hi All; I have hit a snag in my Lucene integration and don't know what to do. My company has a content management product. Each time someone changes the directory structure or a file with in it that portion of the site needs to be re-indexed so the changes are reflected in future searches (indexing must happen during run time). I have written a Indexer class with a static Index() method. The idea is too call the method every time something changes and the index needs to be re-examined. I am hoping the logic put in by Doug Cutting surrounding the UID will make indexing efficient enough to be called so frequently. This class works great when I tested it on my own little site (I have about 2000 file). But when I drop the functionality into the QA environment I get a locking error. I can't access the stack trace, all I can get at is a log file the application writes too. Here is the section my class wrote. It was right in the middle of indexing and bang lock issue. I don't know if the problem is in my code or something in the existing application. Error Message: ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent) |INFO|INDEXING INFO: Start Indexing new content. |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New Index |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432 10f7fe8-write.lock |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent) Here is my code. You will recognize it pretty much as the IndexHTML class from the Lucene demo written by Doug Cutting. I have put a ton of comments in a attempt to understand what is going on. Any help would be appreciated. Luke package com.fbhm.bolt.search; /* * Created on Nov 11, 2004 * * This class will create a single index file for the Content * Management System (CMS). It contains logic to ensure * indexing is done intelligently. Based on IndexHTML.java * from the demo folder that ships with Lucene */ import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import org.apache.lucene.demo.HTMLDocument; import com.alaia.common.debug.Trace; import com.alaia.common.util.AppProperties; /** * @author lshannon Description: br * This class is used to index a content folder. It contains logic to * ensure only new or documents that have been modified since the last * search are indexed. br * Based on code writen by Doug Cutting in the IndexHTML class found in * the Lucene demo */ public class Indexer { //true during deletion pass, this is when the index already exists private static boolean deleting = false; //object to read existing indexes private static IndexReader reader; //object to write to the index folder private static IndexWriter writer; //this will be used to write the index file private static TermEnum uidIter; /* * This static method does all the work, the end result is an up-to-date index folder */ public static void Index() { //we will assume to start the index has been created boolean create = true; //set the name of the index file String indexFileLocation = AppProperties.getPropertyAsString(bolt.search.siteIndex.index.root); //set the name of the content folder String contentFolderLocation = AppProperties.getPropertyAsString(site.root); //manage whether the index needs to be created or not File index = new File(indexFileLocation);
Re: Lucene : avoiding locking
I'm working on a similar project... Make sure that only one call to the index method is occuring at a time. Synchronizing that method should do it. --- Luke Shannon [EMAIL PROTECTED] wrote: Hi All; I have hit a snag in my Lucene integration and don't know what to do. My company has a content management product. Each time someone changes the directory structure or a file with in it that portion of the site needs to be re-indexed so the changes are reflected in future searches (indexing must happen during run time). I have written a Indexer class with a static Index() method. The idea is too call the method every time something changes and the index needs to be re-examined. I am hoping the logic put in by Doug Cutting surrounding the UID will make indexing efficient enough to be called so frequently. This class works great when I tested it on my own little site (I have about 2000 file). But when I drop the functionality into the QA environment I get a locking error. I can't access the stack trace, all I can get at is a log file the application writes too. Here is the section my class wrote. It was right in the middle of indexing and bang lock issue. I don't know if the problem is in my code or something in the existing application. Error Message: ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent) |INFO|INDEXING INFO: Start Indexing new content. |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New Index |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432 10f7fe8-write.lock |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent) Here is my code. You will recognize it pretty much as the IndexHTML class from the Lucene demo written by Doug Cutting. I have put a ton of comments in a attempt to understand what is going on. Any help would be appreciated. Luke package com.fbhm.bolt.search; /* * Created on Nov 11, 2004 * * This class will create a single index file for the Content * Management System (CMS). It contains logic to ensure * indexing is done intelligently. Based on IndexHTML.java * from the demo folder that ships with Lucene */ import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import org.apache.lucene.demo.HTMLDocument; import com.alaia.common.debug.Trace; import com.alaia.common.util.AppProperties; /** * @author lshannon Description: br * This class is used to index a content folder. It contains logic to * ensure only new or documents that have been modified since the last * search are indexed. br * Based on code writen by Doug Cutting in the IndexHTML class found in * the Lucene demo */ public class Indexer { //true during deletion pass, this is when the index already exists private static boolean deleting = false; //object to read existing indexes private static IndexReader reader; //object to write to the index folder private static IndexWriter writer; //this will be used to write the index file private static TermEnum uidIter; /* * This static method does all the work, the end result is an up-to-date index folder */ public static void Index() { //we will assume to start the index has been created boolean create = true; //set
Re: Lucene : avoiding locking
I will try that now. Thank you. - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:56 PM Subject: Re: Lucene : avoiding locking I'm working on a similar project... Make sure that only one call to the index method is occuring at a time. Synchronizing that method should do it. --- Luke Shannon [EMAIL PROTECTED] wrote: Hi All; I have hit a snag in my Lucene integration and don't know what to do. My company has a content management product. Each time someone changes the directory structure or a file with in it that portion of the site needs to be re-indexed so the changes are reflected in future searches (indexing must happen during run time). I have written a Indexer class with a static Index() method. The idea is too call the method every time something changes and the index needs to be re-examined. I am hoping the logic put in by Doug Cutting surrounding the UID will make indexing efficient enough to be called so frequently. This class works great when I tested it on my own little site (I have about 2000 file). But when I drop the functionality into the QA environment I get a locking error. I can't access the stack trace, all I can get at is a log file the application writes too. Here is the section my class wrote. It was right in the middle of indexing and bang lock issue. I don't know if the problem is in my code or something in the existing application. Error Message: ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent) |INFO|INDEXING INFO: Start Indexing new content. |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New Index |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432 10f7fe8-write.lock |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent) Here is my code. You will recognize it pretty much as the IndexHTML class from the Lucene demo written by Doug Cutting. I have put a ton of comments in a attempt to understand what is going on. Any help would be appreciated. Luke package com.fbhm.bolt.search; /* * Created on Nov 11, 2004 * * This class will create a single index file for the Content * Management System (CMS). It contains logic to ensure * indexing is done intelligently. Based on IndexHTML.java * from the demo folder that ships with Lucene */ import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import org.apache.lucene.demo.HTMLDocument; import com.alaia.common.debug.Trace; import com.alaia.common.util.AppProperties; /** * @author lshannon Description: br * This class is used to index a content folder. It contains logic to * ensure only new or documents that have been modified since the last * search are indexed. br * Based on code writen by Doug Cutting in the IndexHTML class found in * the Lucene demo */ public class Indexer { //true during deletion pass, this is when the index already exists private static boolean deleting = false; //object to read existing indexes private static IndexReader reader; //object to write to the index folder private
Re: Lucene : avoiding locking
Syncronizing the method didn't seem to help. The lock is being detected right here in the code: while (uidIter.term() != null uidIter.term().field() == uid uidIter.term().text().compareTo(uid) 0) { //delete stale docs if (deleting) { reader.delete(uidIter.term()); } uidIter.next(); } This runs fine on my own site so I am confused. For now I think I am going to remove the deleting of stale files etc and just rebuild the index each time to see what happens. - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:56 PM Subject: Re: Lucene : avoiding locking I'm working on a similar project... Make sure that only one call to the index method is occuring at a time. Synchronizing that method should do it. --- Luke Shannon [EMAIL PROTECTED] wrote: Hi All; I have hit a snag in my Lucene integration and don't know what to do. My company has a content management product. Each time someone changes the directory structure or a file with in it that portion of the site needs to be re-indexed so the changes are reflected in future searches (indexing must happen during run time). I have written a Indexer class with a static Index() method. The idea is too call the method every time something changes and the index needs to be re-examined. I am hoping the logic put in by Doug Cutting surrounding the UID will make indexing efficient enough to be called so frequently. This class works great when I tested it on my own little site (I have about 2000 file). But when I drop the functionality into the QA environment I get a locking error. I can't access the stack trace, all I can get at is a log file the application writes too. Here is the section my class wrote. It was right in the middle of indexing and bang lock issue. I don't know if the problem is in my code or something in the existing application. Error Message: ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent) |INFO|INDEXING INFO: Start Indexing new content. |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New Index |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432 10f7fe8-write.lock |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent) Here is my code. You will recognize it pretty much as the IndexHTML class from the Lucene demo written by Doug Cutting. I have put a ton of comments in a attempt to understand what is going on. Any help would be appreciated. Luke package com.fbhm.bolt.search; /* * Created on Nov 11, 2004 * * This class will create a single index file for the Content * Management System (CMS). It contains logic to ensure * indexing is done intelligently. Based on IndexHTML.java * from the demo folder that ships with Lucene */ import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import org.apache.lucene.demo.HTMLDocument; import com.alaia.common.debug.Trace; import com.alaia.common.util.AppProperties; /** * @author lshannon Description: br * This class is used to index a content folder. It contains logic to * ensure only new
Re: Query#rewrite Question
Thank you, Erik and Paul. I'm not sure what SpanQuery is, but anyway we've decided to freeze the version of Lucene we use. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene file locking question
Hi folks: My application builds a super-index around the lucene index, e.g. stores some additional information outside of lucene. I am using my own locking outside of the lucene index via FileLock object in the jdk1.4 nio package. My code does the following: FileLock lock=null; try{ lock=myLockFileChannel.lock(); indexing into lucene; indexing additional information; } finally{ try{ commit lucene index by closing the IndexWriter instance. } finally{ if (lock!=null){ lock.release(); } } } Now here is the weird thing, say I terminate the process in the middle of indexing, and run the program again, I would get a Lock obtain time out exception, as long as you delete the stale lock file, the index remains uncorrupted. However, if I turn lucene file lock off since I have a lock outside it anyways, (by doing: static{ System.setProperty(disableLuceneLocks,true); } ) and do the same thing. Instead I get an unrecoverable corrupted index. Does lucene lock really guarentee index integrity under this kind of abuse or am I just getting lucky? If so, can someone shine some light on how? Thanks in advance -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene file locking question
Disabling locking is only recommended for read-only indexes that aren't being modified. I think there is a comment in the code about a good example of this being an index you read off of a CD-ROM. --- John Wang [EMAIL PROTECTED] wrote: Hi folks: My application builds a super-index around the lucene index, e.g. stores some additional information outside of lucene. I am using my own locking outside of the lucene index via FileLock object in the jdk1.4 nio package. My code does the following: FileLock lock=null; try{ lock=myLockFileChannel.lock(); indexing into lucene; indexing additional information; } finally{ try{ commit lucene index by closing the IndexWriter instance. } finally{ if (lock!=null){ lock.release(); } } } Now here is the weird thing, say I terminate the process in the middle of indexing, and run the program again, I would get a Lock obtain time out exception, as long as you delete the stale lock file, the index remains uncorrupted. However, if I turn lucene file lock off since I have a lock outside it anyways, (by doing: static{ System.setProperty(disableLuceneLocks,true); } ) and do the same thing. Instead I get an unrecoverable corrupted index. Does lucene lock really guarentee index integrity under this kind of abuse or am I just getting lucky? If so, can someone shine some light on how? Thanks in advance -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
That's the point: there is no query optimizer in Lucene. Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your the viewpoint of a user. You know my users aren't technicians, so answers like yours won't make them happy. They will only see that I randomly don't allow them to search (with the 1024 limit). They won't understand why am I displaying Please restrict your search a bit more.. when they've just searched for dodge AND vip* and there are only a few documents mathcing this criteria. So, is the only way to make them able to search happily by setting the max. clause limit to MaxInt?! __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Phrase search for more than 4 words throws exception in QueryParser
Hi! How to perform phrase searches for more than four words? This works well with 1.4.2: aa bb cc dd I pass the query as a command line parameter on XP: \aa bb cc dd\ QueryParser translates it to: text:aa text:bb text:cc text:dd Runs, searches, finds proper matches. This throws exeption in QueryParser: aa bb cc dd ee I pass the query as a command line parameter on XP: \aa bb cc dd ee\ The exception's text is: : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 13. Encountered: EOF after : \aa bb cc dd It doesn't matter what words I enter, the only thing that matters is the number of words which can be four at max. Regards, Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Phrase search for more than 4 words throws exception in QueryParser
Sanyi writes: How to perform phrase searches for more than four words? This works well with 1.4.2: aa bb cc dd I pass the query as a command line parameter on XP: \aa bb cc dd\ QueryParser translates it to: text:aa text:bb text:cc text:dd Runs, searches, finds proper matches. This throws exeption in QueryParser: aa bb cc dd ee I pass the query as a command line parameter on XP: \aa bb cc dd ee\ The exception's text is: : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 13. Encountered: EOF after : \aa bb cc dd Works for me on linux: java -cp lucene.jar org.apache.lucene.queryParser.QueryParser 'a b c d e f g h i j k l m n o p q r s t u v w x y z' a b c d e f g h i j k l m n o p q r s t u v w x y z Must be an XP command line problem. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]