RE: Custom filters document numbers
An IndexReader will always see the same set of documents. Even if another process deletes some documents, adds new ones or optimizes the complete index, your IndexReader instance will not see those changes. If you detect that the Lucene index changed (e.g. by calling IndexReader.getCurrentVersion(...) once in a while), you should close and reopen your 'current' IndexReader and recalculate any data that relies on the Lucene document numbers. Regards, Luc. -Original Message- From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] Sent: donderdag 24 februari 2005 14:18 To: Lucene Users List Subject: Custom filters document numbers Given an IndexReader a custom filter is supposed to create a bit set, that maps each document numbers to {'visible', 'invisible'} On the other hand, it is stated that Lucene is allowed to change document numbers. Is it guaranteed that this BitSet's view of document numbers won't change while the BitSet is still in use (or perhaps the corresponding IndexReader is still opened) ? And another (more low-level) question. When Lucene may change document numbers? Is it only when the index is optimized after there has been a delete operation? Regards: StJ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QUERYPARSER + LEXECIAL ERROR
Hi, There's nothing wrong with the searchWrd string itself (I ran a few variations of it trough my test setup) The '\' characters in the source are escapes for the java compiler and will never be seen by lucene. The only way I found to produce a similar Exception (org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 15. Encountered: EOF after : ) is by passing a string without a matching closing quote like: String searchWrd = kid \toy\ OR \; The exception is thrown as soon as you pass the string to QueryParser.parse(). I tested using lucene 1.4.3 and jdk 1.5.0 Luc -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: maandag 17 januari 2005 16:23 To: Lucene Users List Subject: Re: QUERYPARSER + LEXECIAL ERROR Hello, Try: String searchWrd = kid \toy\ OR kid \ball\ You'll have to use a WhitespaceAnalyzer with that, though, or a custom Analyzer that doesn't remove the escape character (\). Otis --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys. Apologies. The Query Parser is giving me an Lexical Error String searchWrd = kid \toy\ OR \ball\ org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 26. Encountered: EOF after : at org.apache.lucene.queryParser.QueryParserTokenManager.getNextT oken(QueryPars erTokenManager.java:1050) What is this Happening? Lucene version : Lucene 1.4.3.jar Jdk version: Jdk 1.4.2 O/s : Win2000 Some body Please Reply WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HTMLParser.getReader returning null
If you use the Field.Text(String name, Reader value) version of the Field.Text constructor, the field is tokenized and indexed but *not* stored. This means you will be able to search and find that document, but to know the original contents you will have to store a copy of it elsewhere. The Field.Text(String name, String value) version does store the document String itself, so that's probably the origin of the confusion. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: donderdag 11 november 2004 20:17 To: Lucene Users List Subject: HTMLParser.getReader returning null Hello; Things were working fine. I have been re-organizing my code to drop into QA when I noticed I was no longer getting search results for my HTML files. When I checked things out I confirmed I was still creating the Documents but realized no content was being indexed. HTMLParser parser = new HTMLParser(f); // Add the tag-stripped contents as a Reader-valued Text field so it will // get tokenized and indexed. doc.add(Field.Text(contents, parser.getReader())); System.out.println(The content is + doc.get(contents)); The SOP line above outputs a null where the contents used to be. Any seen this before? Thanks, Luke - Original Message - From: Will Allen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 11, 2004 1:59 PM Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: BooleanQuery - TooManyClauses
Even if you need to be able to search on ranges that include the time, you could benefit from adding a few extra fields to your documents. For example: add a year field and an hour field: If the user then specifies a range between 2001-08-10 11:00 and 2004-10-11 13:00, you break it up behind the scenes into three parts as follows: - a query on the date field alone, testing on the range 2001-08-11 to 2004-10-10 (i.e. all dates fully within the date range) -= max number of clauses=max number of dates in your documents - a query on the hour field for the first date -= max number of clauses=24 - a query on the hour field for the last date -= max number of clauses=24 (You'll need a special case if the start and end happen to be on the same date of course) I'm not that familiar with the QueryParser syntax yet, but it should look something like this (note the use of curly brackets for the exclusive date ranges): (date:{20010810 TO 20041011}) OR (+date:20010910 +time:[11 TO ]) OR (+date:20041011 +time:{ TO 13}) If you need even more fine-grained ranges, you can extend this idea by adding more fields (at the cost of making the generated query even more complex) You can already add the separate fields to your documents even if you don't use them yet... Regards, Luc -Original Message- From: Terry Steichen [mailto:[EMAIL PROTECTED] Sent: dinsdag 26 oktober 2004 18:28 To: Lucene Users List Subject: Re: BooleanQuery - TooManyClauses I think what Erik's asking is whether you can live with expressing your indexed date in the form of MMDD, without the hour and minute extension. That will sharply educe the number of range query expansion terms. If you're using the timestamp as a unique identifier, you might consider creating two fields, one for the unique identifier (MMDDHHmmssZ) and one for the date (MMDD), and only use the range on the date field (not on the timestamp field) Regards, Terry - Original Message - From: Angelov, Rossen To: 'Lucene Users List' Sent: Tuesday, October 26, 2004 11:43 AM Subject: RE: BooleanQuery - TooManyClauses On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote: Why there is a limit on the number of clauses? and is there any harm in setting MaxClauseCount to Integer.MAX_VALUE? The harm is in performance and resource utilization. Rather than do this, though, read on... I'm using a Range Query on a field that represents dates and getting BooleanQuery$TooManyClauses exception. This is the query - +/article/createddateiso8601:[2003010100 TO 2003123199] Do you really need to do ranges down to that time level? Or are you really just concerned with date? If you indexed using MMDD instead, there would only be a maximum of 365 terms in that range, whereas you've got zillions (ok, I was too lazy to do the math! But far more than 1,024). I need to do range searches. They are part of the requirements and even worse, the range can be as big as up to 10 years for now. It will get bigger. I'm indexing using MMDDHHmmssZ format and as you said there will be more than just 365 terms per year. This number changes every day as new documents are indexed daily. The only limit I can see is the number of documents that were indexed. I guess maxClauseCount can't be more than the indexed documents. I recommend changing how you index dates, or at least use a different field for queries that do not need to concern themselves with the timestamp aspect. What do you mean change how the dates are indexed? By the way this field is indexed as a string. Erik Ross This communication is intended solely for the addressee and is confidential and not for third party unauthorized distribution. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]