Re: Query#rewrite Question

2004-11-11 Thread Erik Hatcher
On Nov 10, 2004, at 9:51 PM, Satoshi Hasegawa wrote:
Our program accepts input in the form of Lucene query syntax from the 
user,
but we wish to perform additional tasks such as thesaurus expansion. 
So I
want to manipulate the Query object that results from parsing.
You may want to consider using an Analyzer to expand queries rather 
than manipulating the query object itself.

My question is, is the result of the Query#rewrite method guaranteed 
to be
either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a
BooleanQuery, do all the constituent clauses also reduce to one of the 
above
three classes?
No.  For example, look at the SpanQuery family.  These do no explicit 
rewriting and thus are left as themselves.

 If not, what if the original Query object was the one that
was obtained from QueryParser#parse method? Can I assume the above in 
this
restricted case?

I experimented with the current version, and the above seems to be 
positive
in this version; I'm asking if this could change in the future. Thank 
you.
I think we'll see QueryParser, or at least more sophisticated versions 
of it, emerge that support SpanQuery's.  In fact, in our book, I 
created a subclas of QueryParser that overrides getFieldQuery and 
returns a SpanNearQuery in order to achieve ordered phrase searching.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Locking issue

2004-11-11 Thread Erik Hatcher
On Nov 11, 2004, at 1:47 AM, [EMAIL PROTECTED] wrote:
Yes, I tried that too and it worked.  The issue is that our
Operations folks plan to install this on a pretty busy box and I
was hoping that Lucene wouldn't cause issues if it only had a
small slice of the CPU.
I don't think that Lucene is causing the issue.  I'd like to wait and 
see if others have opinions/suggestions on this issue.  Again, what 
your example program is doing is unrealistic - you're hammering the 
filesystem and CPU by having infinite loops that do not sleep.  If a 
minimal sleep works then I don't think you'll have to concern the 
operations folks with a bigger box.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an 
accidental reply to a wrong thread..



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Faster highlighting with TermPositionVectors (update)

2004-11-11 Thread Maxim Patramanskij
Hello Mark.

I'm just wondered about the following piece of code from your latest
TokenSources class:

 public static TokenStream getAnyTokenStream(IndexReader reader,int docId, 
String field,Analyzer analyzer) throws IOException
{
TokenStream ts=null;

TermFreqVector tfv=(TermFreqVector) 
reader.getTermFreqVector(docId,field);
if(tfv!=null)
{
if(tfv instanceof TermPositionVector)
{
//read pre-parsed token position info stored on disk
TermPositionVector tpv=(TermPositionVector) 
reader.getTermFreqVector(docId,field);
 ts=getTokenStream(tpv);
}
}
//No token info stored so fall back to analyzing raw content
if(ts==null)
{
ts=getTokenStream(reader,docId,field,analyzer);
}
return ts;
}

Isn't you called getTermFreqVector(docId,field) twice?

 Why not just call:

if(tfv instanceof TermPositionVector)
{
   ts=getTokenStream((TermPositionVector) tvf);
} 


Max

Friday, November 5, 2004, 12:25:13 AM, you wrote:

m Having revisited the original TokenSources code it looks like one of the 
m optimisations I put in will fail if fields are stored with 
m non-contiguous position info (ie the analyzer has messed with token 
m position numbers so they overlap or have gaps like ..3,3,7,8,9,..).
m I've now made the TokenSources code safe by default by assuming token 
m position values are not contiguous and should not be used for sorting.
m For those who know what they are doing  I have added a parameter to one 
m of the methods to turn the optimisation back on if they can guarantee 
m positions are contigous.

m New code is at the same place:
m http://www.inperspective.com/lucene/TokenSources.java

m Cheers
m Mark



m -
m To unsubscribe, e-mail: [EMAIL PROTECTED]
m For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Faster highlighting with TermPositionVectors (update)

2004-11-11 Thread mark harwood
Thanks, Max.
Another schoolboy error in TokenSources.java :) 
More haste, less speed required on my part.

I have updated my code and will post to website
tonight. This change doesn't appear to have made a
noticeable difference in performance but the code is
cleaner.

Cheers
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search scalability

2004-11-11 Thread Otis Gospodnetic
If you load it explicitly, then all 800 MB will make it into RAM.
It's easy to try, the API for this is super simple.

Otis

--- [EMAIL PROTECTED] wrote:

 Does it take 800MB of RAM to load that index into a
 RAMDirectory?  Or are only some of the files loaded into RAM?
 
 --- Otis Gospodnetic [EMAIL PROTECTED] wrote:
 
  Hello,
  
  100 parallel searches going against a single index on a single
  disk
  means a lot of disk seeks all happening at once.  One simple
  way of
  working around this is to load your FSDirectory into
  RAMDirectory. 
  This should be faster (could you report your
  observations/comparisons?).  You can also try using ramfs if
  you are
  using Linux.
  
  Otis
  
  --- Ravi [EMAIL PROTECTED] wrote:
  
We have one large index for a document repository of
  800,000
   documents.
   The size of the index is 800MB. When we do searches against
  the
   index,
   it takes 300-500ms for a single search. We wanted to test
  the
   scalability and tried 100 parallel searches against the
  index with
   the
   same query and the average response time was 13 seconds. We
  used a
   simple IndexSearcher. Same searcher object was shared by all
  the
   searches. I'm sure people have success in configuring lucene
  for
   better
   scalability. Can somebody share their approach?
   
   Thanks 
   Ravi. 
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search scalability

2004-11-11 Thread Ravi
Thanks a lot. I'll use RAMDirectory and post my results.  

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 11, 2004 9:09 AM
To: Lucene Users List
Subject: Re: Search scalability

If you load it explicitly, then all 800 MB will make it into RAM.
It's easy to try, the API for this is super simple.

Otis

--- [EMAIL PROTECTED] wrote:

 Does it take 800MB of RAM to load that index into a RAMDirectory?  Or 
 are only some of the files loaded into RAM?
 
 --- Otis Gospodnetic [EMAIL PROTECTED] wrote:
 
  Hello,
  
  100 parallel searches going against a single index on a single disk 
  means a lot of disk seeks all happening at once.  One simple way of 
  working around this is to load your FSDirectory into RAMDirectory.
  This should be faster (could you report your 
  observations/comparisons?).  You can also try using ramfs if you are

  using Linux.
  
  Otis
  
  --- Ravi [EMAIL PROTECTED] wrote:
  
We have one large index for a document repository of
  800,000
   documents.
   The size of the index is 800MB. When we do searches against
  the
   index,
   it takes 300-500ms for a single search. We wanted to test
  the
   scalability and tried 100 parallel searches against the
  index with
   the
   same query and the average response time was 13 seconds. We
  used a
   simple IndexSearcher. Same searcher object was shared by all
  the
   searches. I'm sure people have success in configuring lucene
  for
   better
   scalability. Can somebody share their approach?
   
   Thanks
   Ravi. 
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Acedemic Question About Indexing

2004-11-11 Thread Luke Shannon
40 Million! Wow. Ok this is the kind of answer I was looking for. The site I
am working on indexes maybe 1000 at any given time. I think I am ok with a
single index.

Thanks.

- Original Message - 
From: Will Allen [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 7:23 PM
Subject: RE: Acedemic Question About Indexing


I have an application that I run monthly that indexes 40 million documents
into 6 indexes, then uses a multisearcher.  The advantage for me is that I
can have multiple writers indexing 1/6 of that total data reducing the time
it takes to index by about 5X.

-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing


Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.

I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.

My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?

Thanks,

Luke


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml where any of these documents can
  be
  changed several times a day?
 
  Thanks,
 
  Luke
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Acedemic Question About Indexing

2004-11-11 Thread Gard Arneson Haugen
Could I ask how  fast the search goes against this index, both for 
simple words and more advanced phrase and boolean searches?
And is there something smart you have done to make this go fast, both on 
the infrastructure or the system it selves?

Best regards,
Gard Arneson Haugen
Email : [EMAIL PROTECTED]
Mobile: +47 93 05 01 91 
Fax   : +47 21 95 51 99
Magenta News AS - Møllergata 8, 0179 Oslo


Will Allen wrote:
I have an application that I run monthly that indexes 40 million documents into 
6 indexes, then uses a multisearcher.  The advantage for me is that I can have 
multiple writers indexing 1/6 of that total data reducing the time it takes to 
index by about 5X.
-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:39 PM
To: Lucene Users List
Subject: Re: Acedemic Question About Indexing
Don't worry, regardless of what I learn in this forum I am telling my
company to get me a copy of that bad boy when it comes out (which as far as
I am concerned can't be soon enough). I will pay for grama's myself.
I think I have reviewed the code you are referring to and have something
similar working in my own indexer (using the uid). All is well.
My stupid question for the day is why would you ever want multiple indexes
running if you can build one smart indexer that does everything as
efficiently as possible? Does the answer to this question move me to multi
threaded indexing territory?
Thanks,
Luke
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 10, 2004 2:08 PM
Subject: Re: Acedemic Question About Indexing

 

Uh, I hate to market it, but it's in the book.  But you don't have
to wait for it, as there already is a Lucene demo that does what you
described.  I am not sure if the demo always recreates the index or
whether it deletes and re-adds only the new and modified files, but if
it's the former, you would only need to modify the demo a little bit to
check the timestamps of File objects and compare them to those stored
in the index (if they are being stored - if not, you should add a field
to hold that data)
Otis
--- Luke Shannon [EMAIL PROTECTED] wrote:
   

I am working on debugging an existing Lucene implementation.
Before I started, I built a demo to understand Lucene. In my demo I
indexed
the entire content hierarhcy all at once, and than optimize this
index and
used it for queries. It was time consuming but very simply.
The code I am currently trying to fix indexes the content hierarchy
by
folder creating a seperate index for each one. Thus it ends up with a
bunch
of indexes. I still don't understand how this works (I am assuming
they get
merged someone that I have tracked down yet) but I have noticed it
doesn't
always index the right folder. This results in the users reporting
inconsistant behavior in searching after they make a change to a
document.
To keep things simiple I would like to remove all the logic that
figures out
which folder to index and just do them all (usually less than 1000
files) so
I end up with one index.
Would indexing time be the only area I would be losing out in, or is
there
something more to the approach of creating multiple indexes and
merging
them.
What is a good approach I can take to indexing a content hierarchy
composed
primarily of pdf, xsl, doc and xml where any of these documents can
be
changed several times a day?
Thanks,
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Will Allen
Any wildcard search will automatically expand your query to the number of terms 
it find in the index that suit the wildcard.

For example:

wild*, would become wild OR wilderness OR wildman etc for each of the terms 
that exist in your index.

It is because of this, that you quickly reach the 1024 limit of clauses.  I 
automatically set it to max int with the following line:

BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );


-Original Message-
From: Sanyi [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 6:46 AM
To: [EMAIL PROTECTED]
Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses


Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an 
accidental reply to a wrong thread..



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HTMLParser.getReader returning null

2004-11-11 Thread Luke Shannon
Hello;

Things were working fine. I have been re-organizing my code to drop into QA
when I noticed I was no longer getting search results for my HTML files.
When I checked things out I confirmed I was still creating the Documents but
realized no content was being indexed.

 HTMLParser parser = new HTMLParser(f);

// Add the tag-stripped contents as a Reader-valued Text field so it
will
// get tokenized and indexed.
doc.add(Field.Text(contents, parser.getReader()));
System.out.println(The content is  + doc.get(contents));

The SOP line above outputs a null where the contents used to be. Any seen
this before?

Thanks,

Luke

- Original Message - 
From: Will Allen [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 1:59 PM
Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses


Any wildcard search will automatically expand your query to the number of
terms it find in the index that suit the wildcard.

For example:

wild*, would become wild OR wilderness OR wildman etc for each of the terms
that exist in your index.

It is because of this, that you quickly reach the 1024 limit of clauses.  I
automatically set it to max int with the following line:

BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );


-Original Message-
From: Sanyi [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 6:46 AM
To: [EMAIL PROTECTED]
Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses


Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of
this word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an
exception.
It should first restrict the search for the 500 documents containing the
word spectrum, then it
should collect the variants of cab* withing these documents, which turns
out in two or three
variants of cab* (cable, cables, maybe some more) and the search should
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still
throws a TooManyClauses
exception instead of saying No results, since there is no
nonexistingword in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an
accidental reply to a wrong thread..



__
Do you Yahoo!?
Check out the new Yahoo! Front Page.
www.yahoo.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Acedemic Question About Indexing

2004-11-11 Thread Will Allen
I have a servlet that instanciates a multisearcher on 6 indexes:
(du -h)
7.2G./0
7.2G./1
7.2G./2
7.2G./3
7.2G./4
7.2G./5
43G .

I recreate the index from scratch each month based upon a 50gig zip file with 
all of the 40 million documents.  I wanted to keep my indexing speed as low as 
possible, without hurting search performace too much, as each searcher 
allocates a certain amount of memory proportional to the number of terms it 
has.  A single large index has a lot of overlap in terms, so it needs less 
memory than multiple indexes.

Anyway, for indexing, I am able to index ~100 documents per second.  The total 
indexing process takes 2.5 days.  I have a powerful machine with 2 
hyperthreaded processors (linux sees 4 processors) and 1GB ram.  I also have 
pretty fast SCSI disks.

I perform no updates or deletes on my indexes.

The indexing process equally divides the work amongst the indexers.  The 
bottleneck of the indexing process is not memory or CPU, rather disk IO of 6 
writers.  If I had faster disks, I could create more indexers.

-Original Message-
From: Sodel Vazquez-Reyes
[mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 11:37 AM
To: Lucene Users List
Cc: Will Allen
Subject: Re: Acedemic Question About Indexing


Will,
could you give more details about your architecture?
-each time update o create new indexes
-data stored at each index
etc.

because it is quite interesting, and I would like to test it.

Sodel



Quoting Luke Shannon [EMAIL PROTECTED]:

 40 Million! Wow. Ok this is the kind of answer I was looking for. The site I
 am working on indexes maybe 1000 at any given time. I think I am ok with a
 single index.

 Thanks.

 - Original Message -
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 7:23 PM
 Subject: RE: Acedemic Question About Indexing


 I have an application that I run monthly that indexes 40 million documents
 into 6 indexes, then uses a multisearcher.  The advantage for me is that I
 can have multiple writers indexing 1/6 of that total data reducing the time
 it takes to index by about 5X.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 2:39 PM
 To: Lucene Users List
 Subject: Re: Acedemic Question About Indexing


 Don't worry, regardless of what I learn in this forum I am telling my
 company to get me a copy of that bad boy when it comes out (which as far as
 I am concerned can't be soon enough). I will pay for grama's myself.

 I think I have reviewed the code you are referring to and have something
 similar working in my own indexer (using the uid). All is well.

 My stupid question for the day is why would you ever want multiple indexes
 running if you can build one smart indexer that does everything as
 efficiently as possible? Does the answer to this question move me to multi
 threaded indexing territory?

 Thanks,

 Luke


 - Original Message -
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, November 10, 2004 2:08 PM
 Subject: Re: Acedemic Question About Indexing


 Uh, I hate to market it, but it's in the book.  But you don't have
 to wait for it, as there already is a Lucene demo that does what you
 described.  I am not sure if the demo always recreates the index or
 whether it deletes and re-adds only the new and modified files, but if
 it's the former, you would only need to modify the demo a little bit to
 check the timestamps of File objects and compare them to those stored
 in the index (if they are being stored - if not, you should add a field
 to hold that data)

 Otis

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  I am working on debugging an existing Lucene implementation.
 
  Before I started, I built a demo to understand Lucene. In my demo I
  indexed
  the entire content hierarhcy all at once, and than optimize this
  index and
  used it for queries. It was time consuming but very simply.
 
  The code I am currently trying to fix indexes the content hierarchy
  by
  folder creating a seperate index for each one. Thus it ends up with a
  bunch
  of indexes. I still don't understand how this works (I am assuming
  they get
  merged someone that I have tracked down yet) but I have noticed it
  doesn't
  always index the right folder. This results in the users reporting
  inconsistant behavior in searching after they make a change to a
  document.
  To keep things simiple I would like to remove all the logic that
  figures out
  which folder to index and just do them all (usually less than 1000
  files) so
  I end up with one index.
 
  Would indexing time be the only area I would be losing out in, or is
  there
  something more to the approach of creating multiple indexes and
  merging
  them.
 
  What is a good approach I can take to indexing a content hierarchy
  composed
  primarily of pdf, xsl, doc and xml 

RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
Yes, I understand all of this, but I don't want to set it to MaxInt, since it 
can easily lead to
(even accidental) DoS attacks.

What I'm saying is that there is no reason for the optimizer to expand wild* to 
more than 1024
variations when I search for somerareword AND wild*, since somerareword is 
only present in let's
say 100 documents, so wild* should only expand to words beginning with wild 
in those 100
documents, then it should work fine with the default 1024 clause limit.

But it doesn't, so I can choose between unuseable queries or accidental DoS 
attacks.

--- Will Allen [EMAIL PROTECTED] wrote:

 Any wildcard search will automatically expand your query to the number of 
 terms it find in the
 index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for each of the terms 
 that exist in your
 index.
 
 It is because of this, that you quickly reach the 1024 limit of clauses.  I 
 automatically set it
 to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
 has a 1024 Clauses
 limit by default which is good enough for me, but I still think it works 
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire document set of 
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it to throw an 
 exception.
 It should first restrict the search for the 500 documents containing the word 
 spectrum, then
 it
 should collect the variants of cab* withing these documents, which turns 
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the search should 
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still throws 
 a TooManyClauses
 exception instead of saying No results, since there is no nonexistingword 
 in my document
 set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an 
 accidental reply to a
 wrong thread..
 
 
   
 __ 
 Do you Yahoo!? 
 Check out the new Yahoo! Front Page. 
 www.yahoo.com 
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query#rewrite Question

2004-11-11 Thread Paul Elschot

On Thursday 11 November 2004 03:51, Satoshi Hasegawa wrote:
 Hello,
 
 Our program accepts input in the form of Lucene query syntax from the user, 
 but we wish to perform additional tasks such as thesaurus expansion. So I 
 want to manipulate the Query object that results from parsing.
 
 My question is, is the result of the Query#rewrite method guaranteed to be 
 either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a 
 BooleanQuery, do all the constituent clauses also reduce to one of the above 
 three classes? If not, what if the original Query object was the one that 
 was obtained from QueryParser#parse method? Can I assume the above in this 
 restricted case?

 I experimented with the current version, and the above seems to be positive 
 in this version; I'm asking if this could change in the future. Thank you. 
 
In general, a Query should either rewrite to another query, or provide a
Weight. During search, the Weight then provides a Scorer to score the docs.

The only other type of query currently available is SpanQuery, which is
a generalization of PhraseQuery. It does not rewrite and provides a Weight.

However, the current QueryParser does not have support for SpanQuery.
So, as long as the QueryParser does not support more than the current types
of queries, and you only use the QueryParser to obtain queries, all the
constituent clauses will reduce as you indicate above.

SpanQuery could be useful for thesaurus expansion. The generalization
it provides is that it allows nested distance queries. For example, in:
word1 word2~2
word2 can expanded to:
word2 or word3 word4~4
leading to a query that is not supported by the current QueryParser:
word1 (word 2 or word3 word4~4)~2

SpanQueries can also enforce an order on the matching subqueries,
but that is difficult to express in the current query syntax.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Daniel Naber
On Thursday 11 November 2004 20:57, Sanyi wrote:

 What I'm saying is that there is no reason for the optimizer to expand
 wild* to more than 1024 variations

That's the point: there is no query optimizer in Lucene.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



weird things in 1.4.2 build

2004-11-11 Thread Hetan Shah
Hi guys,
Thanks for the fantastic mailing list. Where all the questions get 
answered.
Guys I have upgraded my installation from 1.3.final to 1.4.2 and now 
when I try to index the files using IndexHTML the commnad just hangs on 
the prompt or would parse some 4 - 5 files and would simply hang. Any 
idea what could be the issue?

TIA,
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


getting error message

2004-11-11 Thread Hetan Shah
Does anyone know what does the following error message mean?
TIA.
-H
root cause
java.lang.NullPointerException
at 
org.apache.jsp.searchResults_jsp._jspService(searchResults_jsp.java:627)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:210)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:247)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:193)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:256)
at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at 
org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2417)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180)
at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643)
at 
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java:171)
at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:641)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:172)
at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:641)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174)
at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:643)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at 
org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:193)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:781)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:549)
at 
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:589)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:666)
at java.lang.Thread.run(Thread.java:534)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene : avoiding locking

2004-11-11 Thread Luke Shannon
Hi All;

I have hit a snag in my Lucene integration and don't know what to do.

 My company has a content management product. Each time someone changes the
 directory structure or a file with in it that portion of the site needs to
 be re-indexed so the changes are reflected in future searches (indexing
must
 happen during run time).

 I have written a Indexer class with a static Index() method. The idea is
too
 call the method every time something changes and the index needs to be
 re-examined. I am hoping the logic put in by Doug Cutting surrounding the
 UID will make indexing efficient enough to be called so frequently.

 This class works great when I tested it on my own little site (I have about
 2000 file). But when I drop the functionality into the QA environment I get
 a locking error.

 I can't access the stack trace, all I can get at is a log file the
 application writes too. Here is the section my class wrote. It was right in
 the middle of indexing and bang lock issue.

 I don't know if the problem is in my code or something in the existing
 application.

 Error Message:
 ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent)
 |INFO|INDEXING INFO: Start Indexing new content.
 |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New
Index
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING INFO: Beginnging Incremental update comparisions
 |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out:

Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432
 10f7fe8-write.lock
 |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent)

 Here is my code. You will recognize it pretty much as the IndexHTML class
 from the Lucene demo written by Doug Cutting. I have put a ton of comments
 in a attempt to understand what is going on.

 Any help would be appreciated.

 Luke

 package com.fbhm.bolt.search;

 /*
  * Created on Nov 11, 2004
  *
  * This class will create a single index file for the Content
  * Management System (CMS). It contains logic to ensure
  * indexing is done intelligently. Based on IndexHTML.java
  * from the demo folder that ships with Lucene
  */

 import java.io.File;
 import java.io.IOException;
 import java.util.Arrays;
 import java.util.Date;

 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.index.TermEnum;
 import org.pdfbox.searchengine.lucene.LucenePDFDocument;
 import org.apache.lucene.demo.HTMLDocument;

 import com.alaia.common.debug.Trace;
 import com.alaia.common.util.AppProperties;

 /**
  * @author lshannon Description: br
  *   This class is used to index a content folder. It contains logic to
  *   ensure only new or documents that have been modified since the last
  *   search are indexed. br
  *   Based on code writen by Doug Cutting in the IndexHTML class found in
  *   the Lucene demo
  */
 public class Indexer {
  //true during deletion pass, this is when the index already exists
  private static boolean deleting = false;

  //object to read existing indexes
  private static IndexReader reader;

  //object to write to the index folder
  private static IndexWriter writer;

  //this will be used to write the index file
  private static TermEnum uidIter;

  /*
   * This static method does all the work, the end result is an up-to-date
 index folder
  */
  public static void Index() {
   //we will assume to start the index has been created
   boolean create = true;
   //set the name of the index file
   String indexFileLocation =
 AppProperties.getPropertyAsString(bolt.search.siteIndex.index.root);
   //set the name of the content folder
   String contentFolderLocation =
 AppProperties.getPropertyAsString(site.root);
   //manage whether the index needs to be created or not
   File index = new File(indexFileLocation);
   

Re: Lucene : avoiding locking

2004-11-11 Thread yahootintin-lucene
I'm working on a similar project...
Make sure that only one call to the index method is occuring at
a time.  Synchronizing that method should do it.

--- Luke Shannon [EMAIL PROTECTED] wrote:

 Hi All;
 
 I have hit a snag in my Lucene integration and don't know what
 to do.
 
  My company has a content management product. Each time
 someone changes the
  directory structure or a file with in it that portion of the
 site needs to
  be re-indexed so the changes are reflected in future searches
 (indexing
 must
  happen during run time).
 
  I have written a Indexer class with a static Index() method.
 The idea is
 too
  call the method every time something changes and the index
 needs to be
  re-examined. I am hoping the logic put in by Doug Cutting
 surrounding the
  UID will make indexing efficient enough to be called so
 frequently.
 
  This class works great when I tested it on my own little site
 (I have about
  2000 file). But when I drop the functionality into the QA
 environment I get
  a locking error.
 
  I can't access the stack trace, all I can get at is a log
 file the
  application writes too. Here is the section my class wrote.
 It was right in
  the middle of indexing and bang lock issue.
 
  I don't know if the problem is in my code or something in the
 existing
  application.
 
  Error Message:
  ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent)
  |INFO|INDEXING INFO: Start Indexing new content.
  |INFO|INDEXING INFO: Index Folder Did Not Exist. Start
 Creation Of New
 Index
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING ERROR: Unable to index new content Lock obtain
 timed out:
 

Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432
  10f7fe8-write.lock
 
 |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent)
 
  Here is my code. You will recognize it pretty much as the
 IndexHTML class
  from the Lucene demo written by Doug Cutting. I have put a
 ton of comments
  in a attempt to understand what is going on.
 
  Any help would be appreciated.
 
  Luke
 
  package com.fbhm.bolt.search;
 
  /*
   * Created on Nov 11, 2004
   *
   * This class will create a single index file for the Content
   * Management System (CMS). It contains logic to ensure
   * indexing is done intelligently. Based on IndexHTML.java
   * from the demo folder that ships with Lucene
   */
 
  import java.io.File;
  import java.io.IOException;
  import java.util.Arrays;
  import java.util.Date;
 
  import org.apache.lucene.analysis.standard.StandardAnalyzer;
  import org.apache.lucene.document.Document;
  import org.apache.lucene.index.IndexReader;
  import org.apache.lucene.index.IndexWriter;
  import org.apache.lucene.index.Term;
  import org.apache.lucene.index.TermEnum;
  import org.pdfbox.searchengine.lucene.LucenePDFDocument;
  import org.apache.lucene.demo.HTMLDocument;
 
  import com.alaia.common.debug.Trace;
  import com.alaia.common.util.AppProperties;
 
  /**
   * @author lshannon Description: br
   *   This class is used to index a content folder. It
 contains logic to
   *   ensure only new or documents that have been modified
 since the last
   *   search are indexed. br
   *   Based on code writen by Doug Cutting in the IndexHTML
 class found in
   *   the Lucene demo
   */
  public class Indexer {
   //true during deletion pass, this is when the index already
 exists
   private static boolean deleting = false;
 
   //object to read existing indexes
   private static IndexReader reader;
 
   //object to write to the index folder
   private static IndexWriter writer;
 
   //this will be used to write the index file
   private static TermEnum uidIter;
 
   /*
* This static method does all the work, the end result is
 an up-to-date
  index folder
   */
   public static void Index() {
//we will assume to start the index has been created
boolean create = true;
//set 

Re: Lucene : avoiding locking

2004-11-11 Thread Luke Shannon
I will try that now.
Thank you.

- Original Message - 
From: [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 6:56 PM
Subject: Re: Lucene : avoiding locking


 I'm working on a similar project...
 Make sure that only one call to the index method is occuring at
 a time.  Synchronizing that method should do it.

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  Hi All;
 
  I have hit a snag in my Lucene integration and don't know what
  to do.
 
   My company has a content management product. Each time
  someone changes the
   directory structure or a file with in it that portion of the
  site needs to
   be re-indexed so the changes are reflected in future searches
  (indexing
  must
   happen during run time).
 
   I have written a Indexer class with a static Index() method.
  The idea is
  too
   call the method every time something changes and the index
  needs to be
   re-examined. I am hoping the logic put in by Doug Cutting
  surrounding the
   UID will make indexing efficient enough to be called so
  frequently.
 
   This class works great when I tested it on my own little site
  (I have about
   2000 file). But when I drop the functionality into the QA
  environment I get
   a locking error.
 
   I can't access the stack trace, all I can get at is a log
  file the
   application writes too. Here is the section my class wrote.
  It was right in
   the middle of indexing and bang lock issue.
 
   I don't know if the problem is in my code or something in the
  existing
   application.
 
   Error Message:
   ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent)
   |INFO|INDEXING INFO: Start Indexing new content.
   |INFO|INDEXING INFO: Index Folder Did Not Exist. Start
  Creation Of New
  Index
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING ERROR: Unable to index new content Lock obtain
  timed out:
 
 

Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432
   10f7fe8-write.lock
 
  |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent)
 
   Here is my code. You will recognize it pretty much as the
  IndexHTML class
   from the Lucene demo written by Doug Cutting. I have put a
  ton of comments
   in a attempt to understand what is going on.
 
   Any help would be appreciated.
 
   Luke
 
   package com.fbhm.bolt.search;
 
   /*
* Created on Nov 11, 2004
*
* This class will create a single index file for the Content
* Management System (CMS). It contains logic to ensure
* indexing is done intelligently. Based on IndexHTML.java
* from the demo folder that ships with Lucene
*/
 
   import java.io.File;
   import java.io.IOException;
   import java.util.Arrays;
   import java.util.Date;
 
   import org.apache.lucene.analysis.standard.StandardAnalyzer;
   import org.apache.lucene.document.Document;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.Term;
   import org.apache.lucene.index.TermEnum;
   import org.pdfbox.searchengine.lucene.LucenePDFDocument;
   import org.apache.lucene.demo.HTMLDocument;
 
   import com.alaia.common.debug.Trace;
   import com.alaia.common.util.AppProperties;
 
   /**
* @author lshannon Description: br
*   This class is used to index a content folder. It
  contains logic to
*   ensure only new or documents that have been modified
  since the last
*   search are indexed. br
*   Based on code writen by Doug Cutting in the IndexHTML
  class found in
*   the Lucene demo
*/
   public class Indexer {
//true during deletion pass, this is when the index already
  exists
private static boolean deleting = false;
 
//object to read existing indexes
private static IndexReader reader;
 
//object to write to the index folder
private 

Re: Lucene : avoiding locking

2004-11-11 Thread Luke Shannon
Syncronizing the method didn't seem to help. The lock is being detected
right here in the code:

while (uidIter.term() != null
   uidIter.term().field() == uid
   uidIter.term().text().compareTo(uid)  0) {
 //delete stale docs
 if (deleting) {
  reader.delete(uidIter.term());
 }
 uidIter.next();
}

This runs fine on my own site so I am confused. For now I think I am going
to remove the deleting of stale files etc and just rebuild the index each
time to see what happens.

- Original Message - 
From: [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 6:56 PM
Subject: Re: Lucene : avoiding locking


 I'm working on a similar project...
 Make sure that only one call to the index method is occuring at
 a time.  Synchronizing that method should do it.

 --- Luke Shannon [EMAIL PROTECTED] wrote:

  Hi All;
 
  I have hit a snag in my Lucene integration and don't know what
  to do.
 
   My company has a content management product. Each time
  someone changes the
   directory structure or a file with in it that portion of the
  site needs to
   be re-indexed so the changes are reflected in future searches
  (indexing
  must
   happen during run time).
 
   I have written a Indexer class with a static Index() method.
  The idea is
  too
   call the method every time something changes and the index
  needs to be
   re-examined. I am hoping the logic put in by Doug Cutting
  surrounding the
   UID will make indexing efficient enough to be called so
  frequently.
 
   This class works great when I tested it on my own little site
  (I have about
   2000 file). But when I drop the functionality into the QA
  environment I get
   a locking error.
 
   I can't access the stack trace, all I can get at is a log
  file the
   application writes too. Here is the section my class wrote.
  It was right in
   the middle of indexing and bang lock issue.
 
   I don't know if the problem is in my code or something in the
  existing
   application.
 
   Error Message:
   ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent)
   |INFO|INDEXING INFO: Start Indexing new content.
   |INFO|INDEXING INFO: Index Folder Did Not Exist. Start
  Creation Of New
  Index
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING INFO: Beginnging Incremental update
  comparisions
   |INFO|INDEXING ERROR: Unable to index new content Lock obtain
  timed out:
 
 

Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432
   10f7fe8-write.lock
 
  |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent)
 
   Here is my code. You will recognize it pretty much as the
  IndexHTML class
   from the Lucene demo written by Doug Cutting. I have put a
  ton of comments
   in a attempt to understand what is going on.
 
   Any help would be appreciated.
 
   Luke
 
   package com.fbhm.bolt.search;
 
   /*
* Created on Nov 11, 2004
*
* This class will create a single index file for the Content
* Management System (CMS). It contains logic to ensure
* indexing is done intelligently. Based on IndexHTML.java
* from the demo folder that ships with Lucene
*/
 
   import java.io.File;
   import java.io.IOException;
   import java.util.Arrays;
   import java.util.Date;
 
   import org.apache.lucene.analysis.standard.StandardAnalyzer;
   import org.apache.lucene.document.Document;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.Term;
   import org.apache.lucene.index.TermEnum;
   import org.pdfbox.searchengine.lucene.LucenePDFDocument;
   import org.apache.lucene.demo.HTMLDocument;
 
   import com.alaia.common.debug.Trace;
   import com.alaia.common.util.AppProperties;
 
   /**
* @author lshannon Description: br
*   This class is used to index a content folder. It
  contains logic to
*   ensure only new 

Re: Query#rewrite Question

2004-11-11 Thread Satoshi Hasegawa
Thank you, Erik and Paul. I'm not sure what SpanQuery is, but anyway we've 
decided to freeze the version of Lucene we use. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


lucene file locking question

2004-11-11 Thread John Wang
Hi folks:

  My application builds a super-index around the lucene index,
e.g. stores some additional information outside of lucene.

   I am using my own locking outside of the lucene index via
FileLock object in the jdk1.4 nio package.

   My code does the following:

FileLock lock=null;
try{
lock=myLockFileChannel.lock();

indexing into lucene;

indexing additional information;

}

finally{
  try{
  commit lucene index by closing the IndexWriter instance.
  }
  finally{
if (lock!=null){
   lock.release();
}
  }
}


Now here is the weird thing, say I terminate the process in the middle
of indexing, and run the program again, I would get a Lock obtain
time out exception, as long as you delete the stale lock file, the
index remains uncorrupted.

However, if I turn lucene file lock off since I have a lock outside it anyways, 
(by doing: 
static{
System.setProperty(disableLuceneLocks,true);
  }
)

and do the same thing. Instead I get an unrecoverable corrupted index.

Does lucene lock really guarentee index integrity under this kind of
abuse or am I just getting lucky?
If so, can someone shine some light on how?

Thanks in advance

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene file locking question

2004-11-11 Thread yahootintin-lucene
Disabling locking is only recommended for read-only indexes that
aren't being modified.  I think there is a comment in the code
about a good example of this being an index you read off of a
CD-ROM.

--- John Wang [EMAIL PROTECTED] wrote:

 Hi folks:
 
   My application builds a super-index around the lucene
 index,
 e.g. stores some additional information outside of lucene.
 
I am using my own locking outside of the lucene index
 via
 FileLock object in the jdk1.4 nio package.
 
My code does the following:
 
 FileLock lock=null;
 try{
 lock=myLockFileChannel.lock();
 
 indexing into lucene;
 
 indexing additional information;
 
 }
 
 finally{
   try{
   commit lucene index by closing the IndexWriter
 instance.
   }
   finally{
 if (lock!=null){
lock.release();
 }
   }
 }
 
 
 Now here is the weird thing, say I terminate the process in
 the middle
 of indexing, and run the program again, I would get a Lock
 obtain
 time out exception, as long as you delete the stale lock
 file, the
 index remains uncorrupted.
 
 However, if I turn lucene file lock off since I have a lock
 outside it anyways, 
 (by doing: 
 static{
 System.setProperty(disableLuceneLocks,true);
   }
 )
 
 and do the same thing. Instead I get an unrecoverable
 corrupted index.
 
 Does lucene lock really guarentee index integrity under this
 kind of
 abuse or am I just getting lucky?
 If so, can someone shine some light on how?
 
 Thanks in advance
 
 -John
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
 That's the point: there is no query optimizer in Lucene.

Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your 
the viewpoint of a
user. You know my users aren't technicians, so answers like yours won't make 
them happy.
They will only see that I randomly don't allow them to search (with the 1024 
limit). They won't
understand why am I displaying Please restrict your search a bit more.. when 
they've just
searched for dodge AND vip* and there are only a few documents mathcing this 
criteria.

So, is the only way to make them able to search happily by setting the max. 
clause limit to
MaxInt?!




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Phrase search for more than 4 words throws exception in QueryParser

2004-11-11 Thread Sanyi
Hi!

How to perform phrase searches for more than four words?

This works well with 1.4.2:
aa bb cc dd
I pass the query as a command line parameter on XP: \aa bb cc dd\
QueryParser translates it to: text:aa text:bb text:cc text:dd
Runs, searches, finds proper matches.

This throws exeption in QueryParser:
aa bb cc dd ee
I pass the query as a command line parameter on XP: \aa bb cc dd ee\
The exception's text is:
: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column
13.  Encountered: EOF after : \aa bb cc dd

It doesn't matter what words I enter, the only thing that matters is the number 
of words which can
be four at max.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase search for more than 4 words throws exception in QueryParser

2004-11-11 Thread Morus Walter
Sanyi writes:
 
 How to perform phrase searches for more than four words?
 
 This works well with 1.4.2:
 aa bb cc dd
 I pass the query as a command line parameter on XP: \aa bb cc dd\
 QueryParser translates it to: text:aa text:bb text:cc text:dd
 Runs, searches, finds proper matches.
 
 This throws exeption in QueryParser:
 aa bb cc dd ee
 I pass the query as a command line parameter on XP: \aa bb cc dd ee\
 The exception's text is:
 : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, 
 column
 13.  Encountered: EOF after : \aa bb cc dd
 
Works for me on linux:
java -cp lucene.jar org.apache.lucene.queryParser.QueryParser 'a b c d e f g h 
i j k l m n o p q r s t u v w x y z'
a b c d e f g h i j k l m n o p q r s t u v w x y z

Must be an XP command line problem.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]