RE: Custom filters document numbers

2005-02-24 Thread Vanlerberghe, Luc
An IndexReader will always see the same set of documents.
Even if another process deletes some documents, adds new ones or
optimizes the complete index, your IndexReader instance will not see
those changes.

If you detect that the Lucene index changed (e.g. by calling
IndexReader.getCurrentVersion(...) once in a while), you should close
and reopen your 'current' IndexReader and recalculate any data that
relies on the Lucene document numbers.

Regards, Luc.

-Original Message-
From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] 
Sent: donderdag 24 februari 2005 14:18
To: Lucene Users List
Subject: Custom filters  document numbers

Given an IndexReader a custom filter is supposed to create a bit set,
that maps each document numbers to {'visible', 'invisible'} On the other
hand, it is stated that Lucene is allowed to change document numbers.
Is it guaranteed that this BitSet's view of document numbers won't
change while the BitSet is still in use (or perhaps the corresponding
IndexReader is still opened) ?

And another (more low-level) question.
When Lucene may change document numbers?
Is it only when the index is optimized after there has been a delete
operation?

Regards: StJ


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: QUERYPARSER + LEXECIAL ERROR

2005-01-17 Thread Vanlerberghe, Luc
Hi,

There's nothing wrong with the searchWrd string itself (I ran a few
variations of it trough my test setup)
The '\' characters in the source are escapes for the java compiler and
will never be seen by lucene.

The only way I found to produce a similar Exception
(org.apache.lucene.queryParser.ParseException: Lexical error at line 1,
column 15.  Encountered: EOF after : )
is by passing a string without a matching closing quote like:
String  searchWrd = kid \toy\ OR \;

The exception is thrown as soon as you pass the string to
QueryParser.parse().

I tested using lucene 1.4.3 and jdk 1.5.0

Luc

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
 Sent: maandag 17 januari 2005 16:23
 To: Lucene Users List
 Subject: Re: QUERYPARSER + LEXECIAL ERROR
 
 Hello,
 
 Try:
   String  searchWrd = kid \toy\ OR kid \ball\
 
 You'll have to use a WhitespaceAnalyzer with that, though, or 
 a custom Analyzer that doesn't remove the escape character (\).
 
 Otis
 
 
 --- Karthik N S [EMAIL PROTECTED] wrote:
 
  
  
  Hi  Guys.
  
  Apologies.
  
  
  
  The Query Parser is giving me an Lexical Error
  
  String  searchWrd = kid \toy\ OR \ball\ 
  
  org.apache.lucene.queryParser.TokenMgrError: Lexical error 
 at line 1, 
  column 26. Encountered: EOF after : 
  at
 
 org.apache.lucene.queryParser.QueryParserTokenManager.getNextT
 oken(QueryPars
  erTokenManager.java:1050)
  
  What is this Happening?
  
  Lucene version :  Lucene 1.4.3.jar
  Jdk version:  Jdk 1.4.2
  O/s  :  Win2000
  
  Some body Please Reply
  
  
  
  
  
  
  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]
  
  
  
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HTMLParser.getReader returning null

2004-11-12 Thread Vanlerberghe, Luc
If you use the Field.Text(String name, Reader value) version of the
Field.Text constructor, the field is tokenized and indexed but *not*
stored.  This means you will be able to search and find that document,
but to know the original contents you will have to store a copy of it
elsewhere.

The Field.Text(String name, String value) version does store the
document String itself, so that's probably the origin of the confusion.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED] 
 Sent: donderdag 11 november 2004 20:17
 To: Lucene Users List
 Subject: HTMLParser.getReader returning null
 
 Hello;
 
 Things were working fine. I have been re-organizing my code 
 to drop into QA when I noticed I was no longer getting search 
 results for my HTML files.
 When I checked things out I confirmed I was still creating 
 the Documents but realized no content was being indexed.
 
  HTMLParser parser = new HTMLParser(f);
 
 // Add the tag-stripped contents as a Reader-valued Text 
 field so it will
 // get tokenized and indexed.
 doc.add(Field.Text(contents, parser.getReader()));
 System.out.println(The content is  + doc.get(contents));
 
 The SOP line above outputs a null where the contents used to 
 be. Any seen this before?
 
 Thanks,
 
 Luke
 
 - Original Message -
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 1:59 PM
 Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Any wildcard search will automatically expand your query to 
 the number of
 terms it find in the index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for 
 each of the terms
 that exist in your index.
 
 It is because of this, that you quickly reach the 1024 limit 
 of clauses.  I
 automatically set it to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so 
 I know that it
 has a 1024 Clauses
 limit by default which is good enough for me, but I still 
 think it works
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire 
 document set of
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it 
 to throw an
 exception.
 It should first restrict the search for the 500 documents 
 containing the
 word spectrum, then it
 should collect the variants of cab* withing these 
 documents, which turns
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the 
 search should
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still
 throws a TooManyClauses
 exception instead of saying No results, since there is no
 nonexistingword in my document set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an
 accidental reply to a wrong thread..
 
 
 
 __
 Do you Yahoo!?
 Check out the new Yahoo! Front Page.
 www.yahoo.com
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: BooleanQuery - TooManyClauses

2004-10-26 Thread Vanlerberghe, Luc
Even if you need to be able to search on ranges that include the time,
you could benefit from adding a few extra fields to your documents.

For example: add a year field and an hour field:

If the user then specifies a range between 2001-08-10 11:00 and
2004-10-11 13:00, you break it up behind the scenes into three parts as
follows:
- a query on the date field alone, testing on the range 2001-08-11 to
2004-10-10 (i.e. all dates fully within the date range) -= max number
of clauses=max number of dates in your documents
- a query on the hour field for the first date -= max number of
clauses=24
- a query on the hour field for the last date -= max number of
clauses=24
(You'll need a special case if the start and end happen to be on the
same date of course)

I'm not that familiar with the QueryParser syntax yet, but it should
look something like this (note the use of curly brackets for the
exclusive date ranges):
(date:{20010810 TO 20041011}) OR (+date:20010910 +time:[11 TO ]) OR
(+date:20041011 +time:{ TO 13})

If you need even more fine-grained ranges, you can extend this idea by
adding more fields (at the cost of making the generated query even more
complex)

You can already add the separate fields to your documents even if you
don't use them yet...

Regards,

Luc


 -Original Message-
 From: Terry Steichen [mailto:[EMAIL PROTECTED] 
 Sent: dinsdag 26 oktober 2004 18:28
 To: Lucene Users List
 Subject: Re: BooleanQuery - TooManyClauses 
 
 I think what Erik's asking is whether you can live with 
 expressing your indexed date in the form of MMDD, without 
 the hour and minute extension.  That will sharply educe the 
 number of range query expansion terms.  If you're using the 
 timestamp as a unique identifier, you might consider creating 
 two fields, one for the unique identifier (MMDDHHmmssZ) 
 and one for the date (MMDD), and only use the range on 
 the date field (not on the timestamp field)
 
 Regards,
 
 Terry
   - Original Message -
   From: Angelov, Rossen
   To: 'Lucene Users List' 
   Sent: Tuesday, October 26, 2004 11:43 AM
   Subject: RE: BooleanQuery - TooManyClauses 
 
 
   
   On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
Why there is a limit on the number of clauses? and is 
 there any harm in
setting MaxClauseCount to Integer.MAX_VALUE?
   
   The harm is in performance and resource utilization.  
 Rather than do
   this, though, read on...
   
I'm using a Range Query on a field that represents dates 
 and getting
BooleanQuery$TooManyClauses exception.
This is the query -  
 +/article/createddateiso8601:[2003010100 TO
2003123199]
   
   Do you really need to do ranges down to that time level?  
 Or are you
   really just concerned with date?  If you indexed using MMDD
   instead, there would only be a maximum of 365 terms in that range,
   whereas you've got zillions (ok, I was too lazy to do the 
 math!  But
   far more than 1,024).
 
   I need to do range searches. They are part of the 
 requirements and even
   worse, the range can be as big as up to 10 years for now. 
 It will get
   bigger. I'm indexing using MMDDHHmmssZ format and as 
 you said there will
   be more than just 365 terms per year. This number changes 
 every day as new
   documents are indexed daily. The only limit I can see is 
 the number of
   documents that were indexed. I guess maxClauseCount can't 
 be more than the
   indexed documents.
 
   I recommend changing how you index dates, or at least use 
 a different
   field for queries that do not need to concern themselves with the
   timestamp aspect.
 
   What do you mean change how the dates are indexed? By the 
 way this field is
   indexed as a string.
 
   
Erik
   
   
 
   Ross
 
   This communication is intended solely for the addressee and is
   confidential and not for third party unauthorized distribution.
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]