Re: range and content query

2004-09-20 Thread Morus Walter
Chris Fraschetti writes:
 can someone assist me in building or deny the possibility of combing a
 range query and a standard query?
 
 say for instance i have two fields i'm searching on... one being the a
 field with an epoch date associated with the entry, and the
 content  so how can I make a query to select a range of thos
 epochs, as well as search through the content? can it be done in one
 query, or do I have to perform a query upon a query, and if so, what
 might the syntax look like?
 
if you create the query using the API use a boolean query to combine
the two basic queries.

If you use query parser use AND or OR.

Note that range queries are expanded into boolean queries (OR combined)
which may be a problem if the number of terms matching the range is too
big. Depends on your date entries and especially how precise they are.
Alternatively you might consider using a filter.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea
 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, and 
what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete file errors from the indexer. I discovered that restarting
tomcat clears the problem. (Note that I'm recreating the index
completely, not updating.)
I've spent the last couple of hours trolling the archives and I've
found numerous references to windows problems with open files.
Is there a fix for this? How can I force the files to close? What's
the best work-around?
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: range and content query

2004-09-20 Thread Chris Fraschetti
I've more or less figured out the query string required to get a range
of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for
the sake of this example) ... my query has results that I don't
understand. if i do from 0 TO 10, then I only get results matching
0,1,10  ... if i do 0 TO 8, i get all results ... from 0 to 10...   if
i do   1 TO 5  ... then i get results 1,2,3,4,5,10  ... very strange.

here is how my query looks...
query: +date_field:[1 TO 5]

here is how the date was added...
Document doc = new Document();
doc.add(Field.UnIndexed(arcpath_field, filename));
doc.add(Field.Keyword(date_field, date));
doc.add(Field.Text(content_field, content));
writer.addDocument(doc);

I tried Field.Text for the date and also received the same results.
Essentially I have a loop to add 11 strings... indexes 0 to 10... and
add doc0, 0, some text  for each..  and the results i get as as
explained above... any ideas?

Here is my simple searching code.. i'm currently not searching for any
text... i just want to test the range feature right now

query_string =  +(+DATE_FIELD+:[+start_date+ TO +end_date+]);
Searcher searcher = new IndexSearcher(index_path);
QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer());
parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR);
Query query = parser.parse(query_string);
System.out.println(query: +query.toString());
Hits hits = searcher.search(query);


On Mon, 20 Sep 2004 08:24:17 +0200, Morus Walter [EMAIL PROTECTED] wrote:
 Chris Fraschetti writes:
  can someone assist me in building or deny the possibility of combing a
  range query and a standard query?
 
  say for instance i have two fields i'm searching on... one being the a
  field with an epoch date associated with the entry, and the
  content  so how can I make a query to select a range of thos
  epochs, as well as search through the content? can it be done in one
  query, or do I have to perform a query upon a query, and if so, what
  might the syntax look like?
  
 if you create the query using the API use a boolean query to combine
 the two basic queries.
 
 If you use query parser use AND or OR.
 
 Note that range queries are expanded into boolean queries (OR combined)
 which may be a problem if the number of terms matching the range is too
 big. Depends on your date entries and especially how precise they are.
 Alternatively you might consider using a filter.
 
 HTH
Morus
 



-- 
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: range and content query

2004-09-20 Thread Morus Walter
Chris Fraschetti writes:
 I've more or less figured out the query string required to get a range
 of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for
 the sake of this example) ... my query has results that I don't
 understand. if i do from 0 TO 10, then I only get results matching
 0,1,10  ... if i do 0 TO 8, i get all results ... from 0 to 10...   if
 i do   1 TO 5  ... then i get results 1,2,3,4,5,10  ... very strange.
 
that's not strange. Lucene indexes strings and compares strings. Not numbers.
So the order is
1
10
101
11
2
20
21
3
4
and so on

I't up to you to make your number look a way that it will work, e.g.
use leading '0' to get
001
002
003
004
010
011
020
021
...

I think there's a page in the wiki about these issues.

 here is how my query looks...
 query: +date_field:[1 TO 5]
 
 here is how the date was added...
 Document doc = new Document();
 doc.add(Field.UnIndexed(arcpath_field, filename));
 doc.add(Field.Keyword(date_field, date));
 doc.add(Field.Text(content_field, content));
 writer.addDocument(doc);
 
 I tried Field.Text for the date and also received the same results.
 Essentially I have a loop to add 11 strings... indexes 0 to 10... and
 add doc0, 0, some text  for each..  and the results i get as as
 explained above... any ideas?
 
 Here is my simple searching code.. i'm currently not searching for any
 text... i just want to test the range feature right now
 
 query_string =  +(+DATE_FIELD+:[+start_date+ TO +end_date+]);
 Searcher searcher = new IndexSearcher(index_path);
 QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer());
 parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR);
 Query query = parser.parse(query_string);
 System.out.println(query: +query.toString());
 Hits hits = searcher.search(query);
 

It's a bad practice to create search strings that have to be decomposed
by query parser again, if have the parts already at hand.
At least in most cases.
I don't know the details how and when query parser will call the analyzer
and what standard analyzer does with numbers.
What does query.toString() output?

But the main problem seems to be your misunderstanding of searching numbers
in lucene. They are just strings and are treated by their lexical 
representation not their numeric value.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: range and content query

2004-09-20 Thread Chris Fraschetti
very correct you are. changing the format of the numbers when i index
then and when i do the range fixed my problem.. thanks much.


On Mon, 20 Sep 2004 09:08:50 +0200, Morus Walter [EMAIL PROTECTED] wrote:
 Chris Fraschetti writes:
  I've more or less figured out the query string required to get a range
  of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for
  the sake of this example) ... my query has results that I don't
  understand. if i do from 0 TO 10, then I only get results matching
  0,1,10  ... if i do 0 TO 8, i get all results ... from 0 to 10...   if
  i do   1 TO 5  ... then i get results 1,2,3,4,5,10  ... very strange.
 
 that's not strange. Lucene indexes strings and compares strings. Not numbers.
 So the order is
 1
 10
 101
 11
 2
 20
 21
 3
 4
 and so on
 
 I't up to you to make your number look a way that it will work, e.g.
 use leading '0' to get
 001
 002
 003
 004
 010
 011
 020
 021
 ...
 
 I think there's a page in the wiki about these issues.
 
  here is how my query looks...
  query: +date_field:[1 TO 5]
 
  here is how the date was added...
  Document doc = new Document();
  doc.add(Field.UnIndexed(arcpath_field, filename));
  doc.add(Field.Keyword(date_field, date));
  doc.add(Field.Text(content_field, content));
  writer.addDocument(doc);
 
  I tried Field.Text for the date and also received the same results.
  Essentially I have a loop to add 11 strings... indexes 0 to 10... and
  add doc0, 0, some text  for each..  and the results i get as as
  explained above... any ideas?
 
  Here is my simple searching code.. i'm currently not searching for any
  text... i just want to test the range feature right now
 
  query_string =  +(+DATE_FIELD+:[+start_date+ TO +end_date+]);
  Searcher searcher = new IndexSearcher(index_path);
  QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer());
  parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR);
  Query query = parser.parse(query_string);
  System.out.println(query: +query.toString());
  Hits hits = searcher.search(query);
 
 
 It's a bad practice to create search strings that have to be decomposed
 by query parser again, if have the parts already at hand.
 At least in most cases.
 I don't know the details how and when query parser will call the analyzer
 and what standard analyzer does with numbers.
 What does query.toString() output?
 
 But the main problem seems to be your misunderstanding of searching numbers
 in lucene. They are just strings and are treated by their lexical
 representation not their numeric value.
 
 Morus
 



-- 
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexes won't close on windows

2004-09-20 Thread Fred Toth
Hi Sergiu,
My searches take place in tomcat, in a struts action, in a single method
Abbreviated code:
IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);
// code to do a search and extract hits, works fine.
searcher.close();
  reader.close();
I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:
writer = new IndexWriter(index, new StandardAnalyzer(), create);
// add docs
(with the create flag set true). It is here that I get a failure, can't 
delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least as far as
java is concerned. But windows will not allow the indexer to delete the old 
index.

I restarted tomcat and the problem cleared. It's as if the JVM on windows 
doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but it's 
not clear
what the fix might be.

Many thanks,
Fred
At 02:32 AM 9/20/2004, you wrote:
 Hi Fred,
I think that we can help you if you provide us your code, and the context 
in which it is used.
we need to see how you open and close the searcher and the reader, and 
what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete file errors from the indexer. I discovered that restarting
tomcat clears the problem. (Note that I'm recreating the index
completely, not updating.)
I've spent the last couple of hours trolling the archives and I've
found numerous references to windows problems with open files.
Is there a fix for this? How can I force the files to close? What's
the best work-around?
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexes won't close on windows

2004-09-20 Thread Otis Gospodnetic
Fred,

I won't get into the details here, but you shouldn't (have to) open a
new IndexReader/Searcher on each request (I'll assume the code below is
from your Actions'e xecute method).  You should cache and re-use
IndexReaders (and IndexSearchers).  There may be a FAQ entry regarding
that, I'm not sure.  Closing them on every request is also something
you shouldn't do (opening and closing them is, in simple terms, just
doing too much work.  Open N files, read them, close them.  Open N
files, read them, close them.  And so on)

Regarding failing deletion, that's a Windows OS thing - it won't let
you remove a file while another process has it open.  I am not certain
where exactly this error comes from in Lucene (exception stack trace?),
but I thought the Lucene code included work-arounds for this.

Otis

--- Fred Toth [EMAIL PROTECTED] wrote:

 Hi Sergiu,
 
 My searches take place in tomcat, in a struts action, in a single
 method
 Abbreviated code:
 
  IndexReader reader = null;
  IndexSearcher searcher = null;
  reader = IndexReader.open(indexName);
searcher = new IndexSearcher(reader);
  // code to do a search and extract hits, works fine.
  searcher.close();
reader.close();
 
 I have a command-line indexer that is a minor modification of the
 IndexHTML.java that comes with Lucene. It does this:
 
  writer = new IndexWriter(index, new StandardAnalyzer(),
 create);
  // add docs
 
 (with the create flag set true). It is here that I get a failure,
 can't 
 delete _b9.cfs
 or similar. This happens when tomcat is completely idle (we're still 
 testing and
 not live), so all readers and searchers should be closed, as least as
 far as
 java is concerned. But windows will not allow the indexer to delete
 the old 
 index.
 
 I restarted tomcat and the problem cleared. It's as if the JVM on
 windows 
 doesn't
 get the file closes quite right.
 
 I've seen numerous references on this list to similar behavior, but
 it's 
 not clear
 what the fix might be.
 
 Many thanks,
 
 Fred
 
 At 02:32 AM 9/20/2004, you wrote:
   Hi Fred,
 
 I think that we can help you if you provide us your code, and the
 context 
 in which it is used.
 we need to see how you open and close the searcher and the reader,
 and 
 what operations are you doing on index.
 
   All the best,
 
   Sergiu
 
 
 
 Fred Toth wrote:
 
 Hi,
 
 I have built a nice lucene application on linux with no problems,
 but when I ported to windows for the customer, I started
 experiencing
 problems with the index not closing. This prevents re-indexing.
 
 I'm using lucene 1.4.1 under tomcat 5.0.28.
 
 My search operation is very simple and works great:
 
 create reader
 create searcher
 do search
 extract N docs from hits
 close searcher
 close reader
 
 However, on several occasions, when trying to re-index, I get
 can't delete file errors from the indexer. I discovered that
 restarting
 tomcat clears the problem. (Note that I'm recreating the index
 completely, not updating.)
 
 I've spent the last couple of hours trolling the archives and I've
 found numerous references to windows problems with open files.
 
 Is there a fix for this? How can I force the files to close? What's
 the best work-around?
 
 Many thanks,
 
 Fred
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea
Hi Fred,
That's right, there are many references to this kind of problems in the 
lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't 
encourage users to do that
   IndexReader reader = null;
   IndexSearcher searcher = null;
   reader = IndexReader.open(indexName);
 searcher = new IndexSearcher(reader);

   It's better to use the constructor that uses a String to create a 
IndexSearcher.
|*IndexSearcher 
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I 
even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).
2. I can imagine situations when the lucene index must be created at 
each startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
   writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
   writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and to 
open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
   return indexFolder.exists()
and it works propertly  even if that's not the best example of 
testing the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'
that's ptobably because of the way you use the searcher, and probably 
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
   catch(Exception e){
 logger.log(e);
   } blocks

5. Pay attention if you use a multithreading environment, in this case 
you have to make indexing, delition and search synchronized

  So ...
 Have fun,
   Sergiu
PS: I think that I'll submit some code with synchronized 
index/delete/search operations and to tell why I need to use it.

Fred Toth wrote:
Hi Sergiu,
My searches take place in tomcat, in a struts action, in a single method
Abbreviated code:
IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);
// code to do a search and extract hits, works fine.
searcher.close();
  reader.close();
I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:
writer = new IndexWriter(index, new StandardAnalyzer(), create);
// add docs
(with the create flag set true). It is here that I get a failure, 
can't delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least as 
far as
java is concerned. But windows will not allow the indexer to delete 
the old index.

I restarted tomcat and the problem cleared. It's as if the JVM on 
windows doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but 
it's not clear
what the fix might be.

Many thanks,
Fred
At 02:32 AM 9/20/2004, you wrote:
 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, 
and what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete file errors from the indexer. I discovered that 
restarting
tomcat clears the problem. (Note that I'm recreating the index
completely, not updating.)

I've spent the last couple of hours trolling the archives and I've
found numerous references to windows problems with open files.
Is there a fix for this? How can I force the files to close? What's
the best work-around?
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-20 Thread Morus Walter
David Spencer writes:
  
  could you put the current version of your code on that website as a java
 
 Weblog entry updated:
 
 http://searchmorph.com/weblog/index.php?id=23
 
thanks
 
 Great suggestion and thanks for that idiom - I should know such things 
 by now. To clarify the issue, it's just a performance one, not other 
 functionality...anyway I put in the code - and to be scientific I 
 benchmarked it two times before the change and two times after - and the 
 results were suprising the same both times (1:45 to 1:50 with an index 
 that takes up  200MB). Probably there are cases where this will run 
 faster, and the code seems more correct now so it's in.
 
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
 
 
  
  
  An interesting application of this might be an ngram-Index enhanced version
  of the FuzzyQuery. While this introduces more complexity on the indexing
  side, it might be a large speedup for fuzzy searches.
 
 I also thinking of reviewing the list to see if anyone had done a Jaro 
 Winkler fuzzy query yet and doing that
 
I went into another direction, and changed the ngram index and search
to use a simliarity that computes 

   m * m / ( n1 * n2)

where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)

After removing the document boost in the ngram index based on the 
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.

I still use the levenshtein distance to modify the score and sort according
to  score / distance  but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.

I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.

I'm not sure, if the similarity I use, is useful in general, but I 
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexes won't close on windows

2004-09-20 Thread Fred Toth
Hi Otis,
I understand about reusing readers and searchers, but I was
working on the do the simplest thing that can possibly work theory
for starters, in part because I wanted to be sure that I could recreate
the index safely as needed.
I should emphasize that I developed for weeks on linux without ever
seeing this problem, but in less than 24 hours after installing on the
customer's windows box, I hit the error.
So is a close() not really a close()? Is Lucene actually hanging on to
open files? Or is this a JVM on windows bug? (I'm using the latest 1.4.2
from Sun.)
As I mentioned, this has turned up off and on in the mail archives. Is
there no well-understood fix or work-around? I'll get a stack trace
set up for the next time it happens.
Thanks,
Fred
At 08:35 AM 9/20/2004, you wrote:
Fred,
I won't get into the details here, but you shouldn't (have to) open a
new IndexReader/Searcher on each request (I'll assume the code below is
from your Actions'e xecute method).  You should cache and re-use
IndexReaders (and IndexSearchers).  There may be a FAQ entry regarding
that, I'm not sure.  Closing them on every request is also something
you shouldn't do (opening and closing them is, in simple terms, just
doing too much work.  Open N files, read them, close them.  Open N
files, read them, close them.  And so on)
Regarding failing deletion, that's a Windows OS thing - it won't let
you remove a file while another process has it open.  I am not certain
where exactly this error comes from in Lucene (exception stack trace?),
but I thought the Lucene code included work-arounds for this.
Otis
--- Fred Toth [EMAIL PROTECTED] wrote:
 Hi Sergiu,

 My searches take place in tomcat, in a struts action, in a single
 method
 Abbreviated code:

  IndexReader reader = null;
  IndexSearcher searcher = null;
  reader = IndexReader.open(indexName);
searcher = new IndexSearcher(reader);
  // code to do a search and extract hits, works fine.
  searcher.close();
reader.close();

 I have a command-line indexer that is a minor modification of the
 IndexHTML.java that comes with Lucene. It does this:

  writer = new IndexWriter(index, new StandardAnalyzer(),
 create);
  // add docs

 (with the create flag set true). It is here that I get a failure,
 can't
 delete _b9.cfs
 or similar. This happens when tomcat is completely idle (we're still
 testing and
 not live), so all readers and searchers should be closed, as least as
 far as
 java is concerned. But windows will not allow the indexer to delete
 the old
 index.

 I restarted tomcat and the problem cleared. It's as if the JVM on
 windows
 doesn't
 get the file closes quite right.

 I've seen numerous references on this list to similar behavior, but
 it's
 not clear
 what the fix might be.

 Many thanks,

 Fred

 At 02:32 AM 9/20/2004, you wrote:
   Hi Fred,
 
 I think that we can help you if you provide us your code, and the
 context
 in which it is used.
 we need to see how you open and close the searcher and the reader,
 and
 what operations are you doing on index.
 
   All the best,
 
   Sergiu
 
 
 
 Fred Toth wrote:
 
 Hi,
 
 I have built a nice lucene application on linux with no problems,
 but when I ported to windows for the customer, I started
 experiencing
 problems with the index not closing. This prevents re-indexing.
 
 I'm using lucene 1.4.1 under tomcat 5.0.28.
 
 My search operation is very simple and works great:
 
 create reader
 create searcher
 do search
 extract N docs from hits
 close searcher
 close reader
 
 However, on several occasions, when trying to re-index, I get
 can't delete file errors from the indexer. I discovered that
 restarting
 tomcat clears the problem. (Note that I'm recreating the index
 completely, not updating.)
 
 I've spent the last couple of hours trolling the archives and I've
 found numerous references to windows problems with open files.
 
 Is there a fix for this? How can I force the files to close? What's
 the best work-around?
 
 Many thanks,
 
 Fred
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexes won't close on windows

2004-09-20 Thread Fred Toth
Hi Sergiu,
Thanks for your suggestions. I will try using just the IndexSearcher(String...)
and see if that makes a difference in the problem. I can confirm that
I am doing a proper close() and that I'm checking for exceptions. Again,
the problem is not with the search function, but with the command-line
indexer. It is not run at startup, but on demand when the index needs
to be recreated.
Thanks,
Fred
At 08:50 AM 9/20/2004, you wrote:
Hi Fred,
That's right, there are many references to this kind of problems in the 
lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't 
encourage users to do that
   IndexReader reader = null;
   IndexSearcher searcher = null;
   reader = IndexReader.open(indexName);
 searcher = new IndexSearcher(reader);

   It's better to use the constructor that uses a String to create a 
IndexSearcher.
|*IndexSearcher 
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I 
even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).
2. I can imagine situations when the lucene index must be created at each 
startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
   writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
   writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and to 
open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
   return indexFolder.exists()
and it works propertly  even if that's not the best example of testing 
the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'
that's ptobably because of the way you use the searcher, and probably 
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
   catch(Exception e){
 logger.log(e);
   } blocks

5. Pay attention if you use a multithreading environment, in this case you 
have to make indexing, delition and search synchronized

  So ...
 Have fun,
   Sergiu
PS: I think that I'll submit some code with synchronized 
index/delete/search operations and to tell why I need to use it.

Fred Toth wrote:
Hi Sergiu,
My searches take place in tomcat, in a struts action, in a single method
Abbreviated code:
IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);
// code to do a search and extract hits, works fine.
searcher.close();
  reader.close();
I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:
writer = new IndexWriter(index, new StandardAnalyzer(), create);
// add docs
(with the create flag set true). It is here that I get a failure, can't 
delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least as far as
java is concerned. But windows will not allow the indexer to delete the 
old index.

I restarted tomcat and the problem cleared. It's as if the JVM on windows 
doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but it's 
not clear
what the fix might be.

Many thanks,
Fred
At 02:32 AM 9/20/2004, you wrote:
 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, and 
what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete file errors from the indexer. I discovered that restarting
tomcat clears the problem. (Note that I'm recreating the index
completely, not updating.)
I've spent the last couple of hours trolling the archives and I've
found numerous references to windows problems with open files.
Is there a fix for this? How can I force the files to close? What's
the best work-around?
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]

Re[2]: indexes won't close on windows

2004-09-20 Thread Maxim Patramanskij
Hello Fred,

When you recreate an index from the scratch (with the last
IndexWriter constructor's argument true), all IndexReaders must be
closed, cause IndexWriter tries to delete all files entire directory,
where you index being created.

If you have any opened IndexReader within this time, then Windows
locks some of the files, used by IndexReader, preventing them from being
changed by other process.  Thus IndexWriter's constructor is unable
to delete these files and throws IOException.


Max


Monday, September 20, 2004, 4:40:00 PM, you wrote:

FT Hi Sergiu,

FT Thanks for your suggestions. I will try using just the IndexSearcher(String...)
FT and see if that makes a difference in the problem. I can confirm that
FT I am doing a proper close() and that I'm checking for exceptions. Again,
FT the problem is not with the search function, but with the command-line
FT indexer. It is not run at startup, but on demand when the index needs
FT to be recreated.

FT Thanks,

FT Fred


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea
Fred Toth wrote:
Hi Sergiu,
Thanks for your suggestions. I will try using just the 
IndexSearcher(String...)
and see if that makes a difference in the problem. I can confirm that
I am doing a proper close() and that I'm checking for exceptions. Again,
the problem is not with the search function, but with the command-line
indexer. It is not run at startup, but on demand when the index needs
to be recreated.

Thanks,
Fred
I remenber it was one case where the searcher was used in the way you 
use but without keeping the
named reference to the index reader. This is not your case.

why do you get It is here that I get a failure, can't delete _b9.cfs?
are you trying to delete the index folder sometimes or ... why?
maybe one object is still using the index  when you try to delete it.
do you write your errors in log files?
It will be very helpful to have a StackTrace.
All the best,
Sergiu


At 08:50 AM 9/20/2004, you wrote:
Hi Fred,
That's right, there are many references to this kind of problems in 
the lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't 
encourage users to do that
   IndexReader reader = null;
   IndexSearcher searcher = null;
   reader = IndexReader.open(indexName);
 searcher = new IndexSearcher(reader);

   It's better to use the constructor that uses a String to create a 
IndexSearcher.
|*IndexSearcher 
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. 
I even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).
2. I can imagine situations when the lucene index must be created at 
each startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
   writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
   writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and 
to open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
   return indexFolder.exists()
and it works propertly  even if that's not the best example of 
testing the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'
that's ptobably because of the way you use the searcher, and probably 
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
   catch(Exception e){
 logger.log(e);
   } blocks

5. Pay attention if you use a multithreading environment, in this 
case you have to make indexing, delition and search synchronized

  So ...
 Have fun,
   Sergiu
PS: I think that I'll submit some code with synchronized 
index/delete/search operations and to tell why I need to use it.

Fred Toth wrote:
Hi Sergiu,
My searches take place in tomcat, in a struts action, in a single 
method
Abbreviated code:

IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);
// code to do a search and extract hits, works fine.
searcher.close();
  reader.close();
I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:
writer = new IndexWriter(index, new StandardAnalyzer(), 
create);
// add docs

(with the create flag set true). It is here that I get a failure, 
can't delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least 
as far as
java is concerned. But windows will not allow the indexer to delete 
the old index.

I restarted tomcat and the problem cleared. It's as if the JVM on 
windows doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but 
it's not clear
what the fix might be.

Many thanks,
Fred
At 02:32 AM 9/20/2004, you wrote:
 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, 
and what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete 

RE: indexes won't close on windows

2004-09-20 Thread JirĂ­ Kuhn
Hi,

I guess you have answered yourself. I can imagine that Tomcat was serving your 
servlet with constructed index searcher while your command line application wanted to 
recreate the index. Are you protected against this situation?

Jiri.

-Original Message-
From: Fred Toth [mailto:[EMAIL PROTECTED]
Sent: Monday, September 20, 2004 3:40 PM
To: Lucene Users List
Subject: Re: indexes won't close on windows


Hi Sergiu,

Thanks for your suggestions. I will try using just the IndexSearcher(String...)
and see if that makes a difference in the problem. I can confirm that
I am doing a proper close() and that I'm checking for exceptions. Again,
the problem is not with the search function, but with the command-line
indexer. It is not run at startup, but on demand when the index needs
to be recreated.

Thanks,

Fred

At 08:50 AM 9/20/2004, you wrote:
Hi Fred,

That's right, there are many references to this kind of problems in the 
lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't 
encourage users to do that
IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);

It's better to use the constructor that uses a String to create a 
 IndexSearcher.
|*IndexSearcher 
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String
 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I 
even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).

2. I can imagine situations when the lucene index must be created at each 
startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and to 
open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
return indexFolder.exists()

and it works propertly  even if that's not the best example of testing 
the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'

that's ptobably because of the way you use the searcher, and probably 
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
catch(Exception e){
  logger.log(e);
} blocks

5. Pay attention if you use a multithreading environment, in this case you 
have to make indexing, delition and search synchronized

   So ...

  Have fun,

Sergiu

PS: I think that I'll submit some code with synchronized 
index/delete/search operations and to tell why I need to use it.


Fred Toth wrote:

Hi Sergiu,

My searches take place in tomcat, in a struts action, in a single method
Abbreviated code:

 IndexReader reader = null;
 IndexSearcher searcher = null;
 reader = IndexReader.open(indexName);
   searcher = new IndexSearcher(reader);
 // code to do a search and extract hits, works fine.
 searcher.close();
   reader.close();

I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:

 writer = new IndexWriter(index, new StandardAnalyzer(), create);
 // add docs

(with the create flag set true). It is here that I get a failure, can't 
delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least as far as
java is concerned. But windows will not allow the indexer to delete the 
old index.

I restarted tomcat and the problem cleared. It's as if the JVM on windows 
doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but it's 
not clear
what the fix might be.

Many thanks,

Fred

At 02:32 AM 9/20/2004, you wrote:

  Hi Fred,

I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, and 
what operations are you doing on index.

  All the best,

  Sergiu



Fred Toth wrote:

Hi,

I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.

I'm using lucene 1.4.1 under tomcat 5.0.28.

My search operation is very simple and works great:

create reader
create searcher
do search
extract N docs from hits
close searcher
close reader

However, on several occasions, when trying to re-index, I get
can't delete file errors 

Re: Running OutOfMemory while optimizing and searching

2004-09-20 Thread John Z
Doug
 
Thank you for confirming this.
 
ZJ

Doug Cutting [EMAIL PROTECTED] wrote:
John Z wrote:
 We have indexes of around 1 million docs and around 25 searchable fields.
 We noticed that without any searches performed on the indexes, on startup, the 
 memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file 
 is read into memory as per the code. Our .tii files are around 8-10 MB in size and 
 our startup memory foot print is around 60-70 MB.
 
 Then when we start doing our searches, the memory goes up, depending on the fields 
 we search on. We are noticing that if we start searching on new fields, the memory 
 kind of goes up. 
 
 Doug, 
 
 Your calculation below on what is taken up by the searcher, does it take into 
 account the .tii file being read into memory or am I not making any sense ? 
 
 1 byte * Number of searchable fields in your index * Number of docs in 
 your index
 plus
 1k bytes * number of terms in query
 plus
 1k bytes * number of phrase terms in query

You make perfect sense. The formula above does not include the .tii. 
My mistake: I forgot that. By default, every 128th Term in the index is 
read into memory, to permit random access to terms. These are stored in 
the .tii file, compressed. So it is not surprising that they require 7x 
the size of the .tii file in memory.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
Express yourself with Y! Messenger! Free. Download now.

Too many boolean clauses

2004-09-20 Thread Shawn Konopinsky
Hello There,

Due to the fact that the [# TO #] range search works lexographically, I am
forced to build a rather large boolean query to get range data from my
index.

I have an ID field that contains about 500,000 unique ids. If I want to
query all records with ids [1-2000],  I build a boolean query containing all
the numbers in the range. eg. id:(1 2 3 ... 1999 2000)

The problem with this is that I get the following error :
org.apache.lucene.queryParser.ParseException: Too many boolean clauses

Any ideas on how I might circumvent this issue by either finding a way to
rewrite the query, or avoid the error?

Thanks in advance,
Shawn.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many boolean clauses

2004-09-20 Thread Paul Elschot
On Monday 20 September 2004 18:27, Shawn Konopinsky wrote:
 Hello There,

 Due to the fact that the [# TO #] range search works lexographically, I am
 forced to build a rather large boolean query to get range data from my
 index.

 I have an ID field that contains about 500,000 unique ids. If I want to
 query all records with ids [1-2000],  I build a boolean query containing
 all the numbers in the range. eg. id:(1 2 3 ... 1999 2000)

 The problem with this is that I get the following error :
 org.apache.lucene.queryParser.ParseException: Too many boolean clauses

 Any ideas on how I might circumvent this issue by either finding a way to
 rewrite the query, or avoid the error?

You can use this as an example:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/DateFilter.java

(Just click view on the latest version to see the code).

and iteratate over you doc ids instead of over dates.
This will give you a filter for the doc ids you want to query.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Similarity scores: tf(), lengthNorm(), sumOfSquaredWeights().

2004-09-20 Thread Paul Elschot

After last week's discussion on idf() of the similarity score computation
I looked into the score computation a bit deeper.

In the DefaultSimilarity tf() is the sqrt() and lengthNorm() is the inverse
of sqrt(). That means that the factor (docTf * docNorm) actually
implements the square root of the density of the query term in the
document field (ignoring the encoding and decoding of the norm).

Summing these weighted square roots resembles a 
Salton OR p-Norm for p = 1/2, except that Salton defined
the p-Norm's for p = 1, and the result is more like an AND
p-Norm because it depends mostly on the minimum argument.

The pnorm also requires that the sum is taken to the power 1/p,
but this is not necessary as it would not change the ranking.

I looked around for p-Norm's with 0p1, but I didn't find
anything. Is there really nothing about this? A good discussion is here:
http://elvis.slis.indiana.edu/irpub/SIGIR/1994/cite19.htm

I would guess that since the sqrt() has an infinite derivative at zero, it
might well be that this OR p-Norm for p = 1/2 behaves much like a
rather high power AND p-Norm.

The basic summing form of the OR p-Norm also allows a very easy
implementation by just summing the weighted square roots; an AND
p-Norm for p = 1 would have needed some more calculations.
Is this perhaps one of the reasons for using a power p   1 ?

Taking this a bit further, I also wonder about the name of
sumOfSquaredWeights() in the Weight interface.
Shouldn't that rather be  sumOfPowerWeights() and 
by default implement a sum of square roots?
This would allow a more straightforward comprehension
of the of the term weights as directly weighing the term densities.

Section 5 of the reference above has the full weighted
p-Norm formula's. The OR p-Norm there is very close
to the Lucene formula without coord().

Regards,
Paul Elschot

On Tuesday 14 September 2004 23:49, Doug Cutting wrote:
 Your analysis sounds correct.

 At base, a weight is a normalized tf*idf.  So a document weight is:

docTf * idf * docNorm

 and a query weight is:

queryTf * idf * queryNorm

 where queryTf is always one.

 So the product of these is (docTf * idf * docNorm) * (idf * queryNorm),
 which indeed contains idf twice.  I think the best documentation fix
 would be to add another idf(t) clause at the end of the formula, next to
 queryNorm(q), so this is clear.  Does that sound right to you?

 Doug

 Ken McCracken wrote:
  Hi,
 
  I was looking through the score computation when running search, and
  think there may be a discrepancy between what is _documented_ in the
  org.apache.lucene.search.Similarity class overview Javadocs, and what
  actually occurs in the code.
 
  I believe the problem is only with the documentation.
 
  I'm pretty sure that there should be an idf^2 in the sum.  Look at
  org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
  can see that first sumOfSquaredWeights() is called, followed by
  normalize(), during search.  Further, the resulting value stored in
  the field value is set as the weightValue on the TermScorer.
 
  If we look at what happens to TermWeight, sumOfSquaredWeights() sets
  queryWeight to idf * boost.  During normalize(), queryWeight is
  multiplied by the query norm, and value is set to queryWeight * idf
  == idf * boost * query norm * idf == idf^2 * boost * query norm.  This
  becomes the weightValue in the TermScorer that is then used to
  multiply with the appropriate tf, etc., values.
 
  The remaining terms in the Similarity description are properly
  appended.  I also see that the queryNorm effectively cancels out
  (dimensionally, since it is a 1/ square root of a sum of squares of
  idfs) one of the idfs, so the formula still ends up being roughly a
  TF-IDF formula.  But the idf^2 should still be there, along with the
  expansion of queryNorm.
 
  Am I mistaken, or is the documentation off?
 
  Thanks for your help,
  -Ken
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Too many boolean clauses

2004-09-20 Thread Shawn Konopinsky
Hey Paul,

Thanks for the quick reply. Excuse my ignorance, but what do I do with the
generated BitSet?

Also - we are using a pooling feature which contains a pool of
IndexSearchers that are used and tossed back each time we need to search.
I'd hate to have to work around this and open up an IndexReader for this
particular search, where all other searches use the pool. Suggestions?

Thanks,
Shawn.

-Original Message-
From: Paul Elschot [mailto:[EMAIL PROTECTED]
Sent: Monday, September 20, 2004 12:51 PM
To: Lucene Users List
Subject: Re: Too many boolean clauses


On Monday 20 September 2004 18:27, Shawn Konopinsky wrote:
 Hello There,

 Due to the fact that the [# TO #] range search works lexographically, I am
 forced to build a rather large boolean query to get range data from my
 index.

 I have an ID field that contains about 500,000 unique ids. If I want to
 query all records with ids [1-2000],  I build a boolean query containing
 all the numbers in the range. eg. id:(1 2 3 ... 1999 2000)

 The problem with this is that I get the following error :
 org.apache.lucene.queryParser.ParseException: Too many boolean clauses

 Any ideas on how I might circumvent this issue by either finding a way to
 rewrite the query, or avoid the error?

You can use this as an example:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/
search/DateFilter.java

(Just click view on the latest version to see the code).

and iteratate over you doc ids instead of over dates.
This will give you a filter for the doc ids you want to query.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: indexes won't close on windows - solved

2004-09-20 Thread Fred Toth
All,
Many thanks for your help and comments. I found a bug in
my code where, in obscure circumstances, the indexes were
being left open. Now fixed, thanks to everyone's help.
Fred
At 10:30 AM 9/20/2004, you wrote:
Hi,
I guess you have answered yourself. I can imagine that Tomcat was 
serving your servlet with constructed index searcher while your command 
line application wanted to recreate the index. Are you protected against 
this situation?

Jiri.
-Original Message-
From: Fred Toth [mailto:[EMAIL PROTECTED]
Sent: Monday, September 20, 2004 3:40 PM
To: Lucene Users List
Subject: Re: indexes won't close on windows
Hi Sergiu,
Thanks for your suggestions. I will try using just the 
IndexSearcher(String...)
and see if that makes a difference in the problem. I can confirm that
I am doing a proper close() and that I'm checking for exceptions. Again,
the problem is not with the search function, but with the command-line
indexer. It is not run at startup, but on demand when the index needs
to be recreated.

Thanks,
Fred
At 08:50 AM 9/20/2004, you wrote:
Hi Fred,

That's right, there are many references to this kind of problems in the
lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't
encourage users to do that
IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);

It's better to use the constructor that uses a String to create a
 IndexSearcher.
|*IndexSearcher
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/In 
dexSearcher.html#IndexSearcher%28java.lang.String%29*(String
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I
even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).

2. I can imagine situations when the lucene index must be created at each
startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and to
open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
return indexFolder.exists()

and it works propertly  even if that's not the best example of testing
the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'

that's ptobably because of the way you use the searcher, and probably
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
catch(Exception e){
  logger.log(e);
} blocks

5. Pay attention if you use a multithreading environment, in this case you
have to make indexing, delition and search synchronized

   So ...

  Have fun,

Sergiu

PS: I think that I'll submit some code with synchronized
index/delete/search operations and to tell why I need to use it.


Fred Toth wrote:

Hi Sergiu,

My searches take place in tomcat, in a struts action, in a single method
Abbreviated code:

 IndexReader reader = null;
 IndexSearcher searcher = null;
 reader = IndexReader.open(indexName);
   searcher = new IndexSearcher(reader);
 // code to do a search and extract hits, works fine.
 searcher.close();
   reader.close();

I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:

 writer = new IndexWriter(index, new StandardAnalyzer(), create);
 // add docs

(with the create flag set true). It is here that I get a failure, can't
delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still
testing and
not live), so all readers and searchers should be closed, as least as 
far as
java is concerned. But windows will not allow the indexer to delete the
old index.

I restarted tomcat and the problem cleared. It's as if the JVM on windows
doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but it's
not clear
what the fix might be.

Many thanks,

Fred

At 02:32 AM 9/20/2004, you wrote:

  Hi Fred,

I think that we can help you if you provide us your code, and the
context in which it is used.
we need to see how you open and close the searcher and the reader, and
what operations are you doing on index.

  All the best,

  Sergiu



Fred Toth wrote:

Hi,

I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.

I'm using lucene 1.4.1 under tomcat 5.0.28.

My search operation is very simple and 

Re: Too many boolean clauses

2004-09-20 Thread Paul Elschot
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote:
 Hey Paul,

 Thanks for the quick reply. Excuse my ignorance, but what do I do with the
 generated BitSet?

You can return it in in the bits() method of the object implementing your
org.apache.lucene.search.Filter (http://jakarta.apache.org/lucene/docs/api/index.html)
Then pass the Filter to IndexSearcher.search() with the query.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many boolean clauses

2004-09-20 Thread Paul Elschot
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote:
 Hey Paul,

...

 Also - we are using a pooling feature which contains a pool of
 IndexSearchers that are used and tossed back each time we need to search.
 I'd hate to have to work around this and open up an IndexReader for this
 particular search, where all other searches use the pool. Suggestions?

You could use a map from the IndexSearcher back to the IndexReader that was
used to create it. (It's a bit of a waste because the IndexSearcher has a reader
attribute internally.)

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighting PDF file after the search

2004-09-20 Thread Balasubramanian . Vijay




Hello,

I can successfully index and search the PDF documents, however i am not
able to highlight the searched text in my original PDF file (ie: like
dtSearch
highlights on original file)

I took a look at the highlighter in sandbox, compiled it and have it
ready.  I am wondering if this highlighter is for highlighting indexed
documents or
can it be used for PDF Files as is !  Please enlighten !

Thanks,

Vijay Balasubramanian
DPRA Inc.,


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting PDF file after the search

2004-09-20 Thread David Spencer
[EMAIL PROTECTED] wrote:

Hello,
I can successfully index and search the PDF documents, however i am not
able to highlight the searched text in my original PDF file (ie: like
dtSearch
highlights on original file)
I took a look at the highlighter in sandbox, compiled it and have it
ready.  I am wondering if this highlighter is for highlighting indexed
documents or
can it be used for PDF Files as is !  Please enlighten !
I did this a few weeks ago.
There are two ways, and they both revolve round the same thing, you need 
the tokenized PDF text available.

[a] Store the tokenized PDF text in the index, or in some other file on 
disk i.e. a cache ( but cache is a misleading term, as you can't have 
a cache miss unless you can do [b]).

[b] Tokenize it on the fly when you call getBestFragments() - the 1st 
arg, the TokenStream, should be one that takes a PDF file as input and 
tokenizes it.

http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String)
Thanks,
Vijay Balasubramanian
DPRA Inc.,
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Highlighting PDF file after the search

2004-09-20 Thread Balasubramanian . Vijay




Thanks David.  I'll give that a shot and let you know.

Vijay Balasubramanian
DPRA Inc.,
214 665 7503


   

  David Spencer

  dave-lucene-userTo:   Lucene Users List [EMAIL 
PROTECTED]
  @tropo.com  cc: 

   Subject:  Re: Highlighting PDF file 
after the search
  09/20/2004 05:02 

  PM   

  Please respond to

  Lucene Users List

   

   





[EMAIL PROTECTED] wrote:




 Hello,

 I can successfully index and search the PDF documents, however i am
not
 able to highlight the searched text in my original PDF file (ie: like
 dtSearch
 highlights on original file)

 I took a look at the highlighter in sandbox, compiled it and have it
 ready.  I am wondering if this highlighter is for highlighting indexed
 documents or
 can it be used for PDF Files as is !  Please enlighten !

I did this a few weeks ago.

There are two ways, and they both revolve round the same thing, you need

the tokenized PDF text available.

[a] Store the tokenized PDF text in the index, or in some other file on
disk i.e. a cache ( but cache is a misleading term, as you can't have
a cache miss unless you can do [b]).

[b] Tokenize it on the fly when you call getBestFragments() - the 1st
arg, the TokenStream, should be one that takes a PDF file as input and
tokenizes it.

http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String)


 Thanks,

 Vijay Balasubramanian
 DPRA Inc.,


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Highlighting PDF file after the search

2004-09-20 Thread Bruce Ritchie
 From: [EMAIL PROTECTED] 

 I can successfully index and search the PDF documents, 
 however i am not able to highlight the searched text in my 
 original PDF file (ie: like dtSearch highlights on original file)
 
 I took a look at the highlighter in sandbox, compiled it and 
 have it ready.  I am wondering if this highlighter is for 
 highlighting indexed documents or can it be used for PDF 
 Files as is !  Please enlighten !

The highlighter code in sandbox can facilitate highlighting of text
*extracted* from the PDF, however it does nothing for you to highlight
search terms *inside* of the PDF. For that you will need some sort of tool
that can modify the PDF on the fly as the user views it. I know of no quick
and dirty tool that allows you to do this, though there is quite a few
projects and products which allow you to manipulate PDF files which likely
can be used to obtain the behavior you are looking for (with some effort on
your part).


Regards,

Bruce Ritchie


smime.p7s
Description: S/MIME cryptographic signature


Problems with Lucene + BDB (Berkeley DB) integration

2004-09-20 Thread Christian Rodriguez
Hi everyone, 

I am trying to use the Lucene + BDB integration from the sandbox
(http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/).
I installed C Berkeley DB 4.2.52 and I have the Lucene jar file.

I have an example program that indexes 4 small text files in a
directory (its very similar to the IndexFiles.java in the Lucene demo,
except that it uses BDB + Lucene). The problem I have is that
executing the indexing program generates different results each time I
run it. For example: If I start with an empty index, run the indexing
program and then query the index I get the correct results; then I
delete the index to start from scratch again, and perform the same
sequence and I get no results. (?)

What puzzles me is the non-deterministic results... the same execution
sequence generates two different results. I then wrote a program to
dump the index and I found out that the list of files that end up in
the index is different every time I index those 4 files.

For example:
1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm,
_4.frq, _4.prx, _4.tii, segments, deletable. (9 files)
2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx,
_4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11
files)

Does anyone have any idea why this is happening?
Has anyone been able to use the BDB + Lucene integration with no problems?

Id appreciate any help or pointers.
Thanks!
Xtian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems with Lucene + BDB (Berkeley DB) integration

2004-09-20 Thread Andy Goodell
I used BDB + lucene successfully using the lucene 1.3 distribution,
but it broke in my application with the 1.4 distribution.  The 1.4
dist uses a different file system by default, the cluster file
system, so maybe that is the source of the issues.

good luck,
andy g


On Mon, 20 Sep 2004 19:36:51 -0300, Christian Rodriguez
[EMAIL PROTECTED] wrote:
 Hi everyone,
 
 I am trying to use the Lucene + BDB integration from the sandbox
 (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/).
 I installed C Berkeley DB 4.2.52 and I have the Lucene jar file.
 
 I have an example program that indexes 4 small text files in a
 directory (its very similar to the IndexFiles.java in the Lucene demo,
 except that it uses BDB + Lucene). The problem I have is that
 executing the indexing program generates different results each time I
 run it. For example: If I start with an empty index, run the indexing
 program and then query the index I get the correct results; then I
 delete the index to start from scratch again, and perform the same
 sequence and I get no results. (?)
 
 What puzzles me is the non-deterministic results... the same execution
 sequence generates two different results. I then wrote a program to
 dump the index and I found out that the list of files that end up in
 the index is different every time I index those 4 files.
 
 For example:
 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm,
 _4.frq, _4.prx, _4.tii, segments, deletable. (9 files)
 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx,
 _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11
 files)
 
 Does anyone have any idea why this is happening?
 Has anyone been able to use the BDB + Lucene integration with no problems?
 
 Id appreciate any help or pointers.
 Thanks!
 Xtian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use of SortComparator.getComparable() ?

2004-09-20 Thread Tea Yu
Dear all,

I'm recently implementing a sort logic that leverages an external index,
however, I'm confused by the newComparator() and getComparable() in
SortComparator.

It seems natural to me that IndexSearcher - FieldSortedHitQueue -
factory.newComparator().  However, what's the use of getComparable() if
newComparator() is doing the job?  Any use scenario?

Thanks
Tea


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



WildCardQuery

2004-09-20 Thread Raju, Robinson (Cognizant)

Is there a limitation in Lucene when it comes to wildcard search ?
Is it a problem if we use less than 3 characters along with a
wildcard(*).
Gives me error if I try using 45* , *34 , *3 ..etc .
Too Many Clauses Error
Doesn't happen if '?' is used instead of '*'.
The intriguing thing is , that it is not consistent . 00* doesn't fail.
Am I missing something ?

Robin

This e-mail and any files transmitted with it are for the sole use of the intended 
recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply e-mail and 
destroy all copies of the original message.
Any unauthorised review, use, disclosure, dissemination, forwarding, printing or 
copying of this email or any action taken in reliance on this e-mail is strictly
prohibited and may be unlawful.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]