AW: BooleanQuery$TooManyClauses

2004-03-29 Thread Karsten Konrad

Hi,

the final version throws the exception when the query expansion results in more than 
1000 tokens; if you get this in range queries over dates, try indexing the dates with 
a rounded date format - not based on milliseconds. By this, you effectively reduce the 
number of tokens that will be expanded from the date range.

>>
I've noticed the same problem.. The strange thing is that it only 
happens on some queries.  For example the query "blog" results in this 
exception but the query for "linux" in my index works just fine. 
>>

The "blog" does not cause this exception, rather some range that you give.

>>
I'm also using a DateRange but I disabled it still noticed the same behavior.
>>

That would be really strange. Are you absolutely sure?

Regards,

Karsten

-Ursprüngliche Nachricht-
Von: Kevin A. Burton [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 30. März 2004 09:25
An: Lucene Users List
Betreff: Re: BooleanQuery$TooManyClauses


hui wrote:

>Hi,
>I have a range query for the date like [20011201 To 20040201], it works 
>fine for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got 
>"BooleanQuery$TooManyClauses" exception sometimes no matter the index 
>is created by 1.3RC1 or 1.3 final. Check on the email archive, it seems 
>related with maxClauseCount. Is increasing maxClauseCount the only way 
>to avoid this issue in 1.3 final? The dev mail list has some discussion 
>on the future plan on this.
>  
>
I've noticed the same problem.. The strange thing is that it only 
happens on some queries.  For example the query "blog" results in this 
exception but the query for "linux" in my index works just fine. 

This is the stacktrace if anyone's interested:

org.apache.lucene.search.BooleanQuery$TooManyClauses
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:99)
at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188)
at org.apache.lucene.search.Query.weight(Query.java:120)
at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
at 
org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:150)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.(Hits.java:80)
at org.apache.lucene.search.Searcher.search(Searcher.java:71)

For the record I'm also using a DateRange but I disabled it still noticed the same 
behavior.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery$TooManyClauses

2004-03-29 Thread Kevin A. Burton
hui wrote:

Hi,
I have a range query for the date like [20011201 To 20040201], it works fine
for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got
"BooleanQuery$TooManyClauses" exception sometimes no matter the index is
created by 1.3RC1 or 1.3 final. Check on the email archive, it seems related
with maxClauseCount. Is increasing maxClauseCount the only way to avoid this
issue in 1.3 final? The dev mail list has some discussion on the future plan
on this.
 

I've noticed the same problem.. The strange thing is that it only 
happens on some queries.  For example the query "blog" results in this 
exception but the query for "linux" in my index works just fine. 

This is the stacktrace if anyone's interested:

org.apache.lucene.search.BooleanQuery$TooManyClauses
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
   at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:99)
   at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
   at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
   at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188)
   at org.apache.lucene.search.Query.weight(Query.java:120)
   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
   at 
org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:150)
   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
   at org.apache.lucene.search.Hits.(Hits.java:80)
   at org.apache.lucene.search.Searcher.search(Searcher.java:71)

For the record I'm also using a DateRange but I disabled it still noticed the same behavior.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: What happened with build.xml in CVS?

2004-03-29 Thread Vladimir Yuryev
Thanks, Erik.
Ant 1.6.1 works with build.xml v.1.58 without problems.
Vladimir.
On Mon, 29 Mar 2004 08:32:56 -0500
 Erik Hatcher <[EMAIL PROTECTED]> wrote:
Cool... my sinister plan of subversively getting the world to upgrade 
to Ant 1.6 is working!  :)

	Erik

On Mar 29, 2004, at 4:34 AM, Rob Oxspring wrote:

Looks like Erik's commits 2 days back have up'd the depencancy from 
ant 1.5 to 1.6.  Previously only selected tasks were allowed outside 
of targets and tstamp doesn't look like one of them.

Rob

Vladimir Yuryev wrote:
Hi !
I have made latest update from lucene CVS, in which build.xml has 
problems:
Buildfile: /home/vyuryev/workspace/jakarta-lucene/build.xml
BUILD FAILED: 
file:/home/vyuryev/workspace/jakarta-lucene/build.xml:11: Unexpected 
element "tstamp"
Total time: 297 milliseconds
Best Regards,
Vladimir Yuryev
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Special Characters

2004-03-29 Thread Gabriela D
Guys,
Help please..

Gabriela D <[EMAIL PROTECTED]> wrote:

Dear All,

I have modified StandardTokenizer.jj to include many of special character in the 
token. However, still the problem exists as the index does not include the characters 
":,{,},(,)" and "%". Can somebody help me as to what should I modify so that index 
will include words with these special characters?

Your help is appreciated.

Thanks and Regards,

Harsha



-
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.

-
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Esmond Pitt
Don't want to start a buffer size war, but these have always seemed too
small to me. I'd recommend upping both InputStream and OutputStream buffer
sizes to at least 4k, as this is the cluster size on most disks these days,
and also a common VM page size. Reading and writing in smaller quantities
than these is definitely suboptimal.

Esmond Pitt

- Original Message - 
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, March 30, 2004 8:16 AM
Subject: Re: Lucene optimization with one large index and numerous small
indexes.


> Kevin A. Burton wrote:
> >> One way to force larger read-aheads might be to pump up Lucene's input
> >> buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE
> >> to 1024*1024 or larger.  You'll want to do this just for the merge
> >> process and not for searching and indexing.  That should help you
> >> spend more time doing transfers with less wasted on seeks.  If that
> >> helps, then perhaps we ought to make this settable via system property
> >> or somesuch.
> >>
> > Good suggestion... seems about 10% -> 15% faster in a few strawman
> > benchmarks I ran.
>
> How long is it taking to merge your 5GB index?  Do you have any stats
> about disk utilization during merge (seeks/second, bytes
> transferred/second)?  Did you try buffer sizes even larger than 1MB?
> Are you writing to a different disk, as suggested?
>
> > Note that right now this var is final and not public... so that will
> > probably need to change.
>
> Perhaps.  I'm reticent to make it too easy to change this.  People tend
> to randomly tweak every available knob and then report bugs, or, if it
> doesn't crash, start recommending that everyone else tweak the knob as
> they do.  There are lots of tradeoffs with buffer size, cases that folks
> might not think of (like that a wildcard query creates a buffer for
> every term that matches), etc.
>
> > Does it make sense to also increase the
> > OutputStream.BUFFER_SIZE?  This would seem to make sense since an
> > optimize is a large number of reads and writes.
>
> It might help a little if you're merging to the same disk as you're
> reading from, but probably not a lot.  If you're merging to a different
> disk then it shouldn't make much difference at all.
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

How long is it taking to merge your 5GB index?  Do you have any stats 
about disk utilization during merge (seeks/second, bytes 
transferred/second)?  Did you try buffer sizes even larger than 1MB? 
Are you writing to a different disk, as suggested?
I'll do some more testing tonight and get back to you

Note that right now this var is final and not public... so that will 
probably need to change.


Perhaps.  I'm reticent to make it too easy to change this.  People 
tend to randomly tweak every available knob and then report bugs, or, 
if it doesn't crash, start recommending that everyone else tweak the 
knob as they do.  There are lots of tradeoffs with buffer size, cases 
that folks might not think of (like that a wildcard query creates a 
buffer for every term that matches), etc.
Or you can do what I do and recompile ;) 

Does it make sense to also increase the OutputStream.BUFFER_SIZE?  
This would seem to make sense since an optimize is a large number of 
reads and writes.


It might help a little if you're merging to the same disk as you're 
reading from, but probably not a lot.  If you're merging to a 
different disk then it shouldn't make much difference at all.

Right now we are merging to the same disk...  I'll perform some real 
benchmarks with this var too.  Long term we're going to migrate to using 
to SCSI disks per machine and then doing parallel queries across them 
with optimized indexes.

Also with modern disk controllers and filesystems I'm not sure how much 
difference this should make.  Both Reiser and XFS do a lot of internal 
buffering as does our disk controller.  I guess I'll find out...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


BooleanQuery$TooManyClauses

2004-03-29 Thread hui
Hi,
I have a range query for the date like [20011201 To 20040201], it works fine
for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got
"BooleanQuery$TooManyClauses" exception sometimes no matter the index is
created by 1.3RC1 or 1.3 final. Check on the email archive, it seems related
with maxClauseCount. Is increasing maxClauseCount the only way to avoid this
issue in 1.3 final? The dev mail list has some discussion on the future plan
on this.

Regards,
Hui



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Javadocs lucene 1.4

2004-03-29 Thread Doug Cutting
Lucene 1.4 has not been released.  Until it is released, you need to 
check out the sources from CVS and build them, including javadoc.

Doug

Stephane James Vaucher wrote:
Are the javadocs available on the site?

I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery)
somewhere on the lucene website. I've subscribed to the users mailing
list, but I've never got a feel for the new version. Is there any way
for this to happen, or should I await 1.4-rc1?
cheers,
sv
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Javadocs lucene 1.4

2004-03-29 Thread Stephane James Vaucher
Are the javadocs available on the site?

I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery)
somewhere on the lucene website. I've subscribed to the users mailing
list, but I've never got a feel for the new version. Is there any way
for this to happen, or should I await 1.4-rc1?

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote:
One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you 
spend more time doing transfers with less wasted on seeks.  If that 
helps, then perhaps we ought to make this settable via system property 
or somesuch.

Good suggestion... seems about 10% -> 15% faster in a few strawman 
benchmarks I ran.  
How long is it taking to merge your 5GB index?  Do you have any stats 
about disk utilization during merge (seeks/second, bytes 
transferred/second)?  Did you try buffer sizes even larger than 1MB? 
Are you writing to a different disk, as suggested?

Note that right now this var is final and not public... so that will 
probably need to change.
Perhaps.  I'm reticent to make it too easy to change this.  People tend 
to randomly tweak every available knob and then report bugs, or, if it 
doesn't crash, start recommending that everyone else tweak the knob as 
they do.  There are lots of tradeoffs with buffer size, cases that folks 
might not think of (like that a wildcard query creates a buffer for 
every term that matches), etc.

Does it make sense to also increase the 
OutputStream.BUFFER_SIZE?  This would seem to make sense since an 
optimize is a large number of reads and writes.
It might help a little if you're merging to the same disk as you're 
reading from, but probably not a lot.  If you're merging to a different 
disk then it shouldn't make much difference at all.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Stephane James Vaucher
I've added some information contained on this thread on the wiki.

http://wiki.apache.org/jakarta-lucene/DateRangeQueries

If you wish to add more information, go right ahead, but since I added
this info, I believe it's ultimately my responsibility to maintain it.

sv

On Mon, 29 Mar 2004, Kevin A. Burton wrote:

> Erik Hatcher wrote:
>
> >
> > One more point... caching is done by the IndexReader used for the
> > search, so you will need to keep that instance (i.e. the
> > IndexSearcher) around to benefit from the caching.
> >
> Great... Damn... looked at the source of CachingWrapperFilter and it
> makes sense.  Thanks for the pointer.  The results were pretty amazing.
> Here are the results before and after. Times are in millis:
>
> Before caching the Field:
>
> Searching for Jakarta:
> 2238
> 1910
> 1899
> 1901
> 1904
> 1906
>
> After caching the field:
> 2253
> 10
> 6
> 8
> 6
> 6
>
> That's a HUGE difference :)
>
> I'm very happy :)
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-29 Thread Stephane James Vaucher
Mark,

Thanks for the update, since I contributed the page, I was going to modify
it (I don't want to force work on other.

sv

On Mon, 29 Mar 2004 [EMAIL PROTECTED] wrote:

> Hi Doug,
> Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally 
> useful than my
> implementation :-)
> Unless anyone has a particularly good reason I'll remove the link to my code that 
> Stephane put on the Wiki contributions page.
> I definitely find BoostingQuery very useful and would be happy to see it in Lucene 
> core but I'm not sure its popular
> enough to warrant adding special support to the query parser.
>
> BTW, I've had a thought about your suggestion for making the highlighter use some 
> form of RAMindex of sentence fragments
> and then querying it to get the best fragments. This is nice in theory but could 
> fail to find anything if the query is of these forms:
> a AND b
> "a b"
> When the code that breaks a doc into "sentence docs" splits co-occuring "a" and "b" 
> terms into seperate docs
> this would produce no match. I dont think there's an easy way round that so I'll 
> stick to the current approach of scoring
> fragments simply based on terms found in the query.
>
>
> Cheers
> Mark
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Kevin A. Burton
Erik Hatcher wrote:

One more point... caching is done by the IndexReader used for the 
search, so you will need to keep that instance (i.e. the 
IndexSearcher) around to benefit from the caching.

Great... Damn... looked at the source of CachingWrapperFilter and it 
makes sense.  Thanks for the pointer.  The results were pretty amazing.  
Here are the results before and after. Times are in millis:

Before caching the Field:

Searching for Jakarta:
2238
1910
1899
1901
1904
1906
After caching the field:
2253
10
6
8
6
6
That's a HUGE difference :)

I'm very happy :)

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster




signature.asc
Description: OpenPGP digital signature


Re: Tracking/Monitoring Search Terms in Lucene

2004-03-29 Thread Kevin A. Burton
Katie Lord wrote:

I am trying to figure out how to track the search terms that visitors are
using on our site on a monthly basis. Do you all have any suggestions?
 

Don't use lucene for this... just have your form record the search terms.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you 
spend more time doing transfers with less wasted on seeks.  If that 
helps, then perhaps we ought to make this settable via system property 
or somesuch.

Good suggestion... seems about 10% -> 15% faster in a few strawman 
benchmarks I ran.   

Note that right now this var is final and not public... so that will 
probably need to change.  Does it make sense to also increase the 
OutputStream.BUFFER_SIZE?  This would seem to make sense since an 
optimize is a large number of reads and writes.  

I'm obviously willing to throw memory at the problem

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Demoting results

2004-03-29 Thread markharw00d
>>You could, if you fail to find any fragments that match the entire 
>>query, re-query the fragments with a flattened query containing just an 
>>OR of all of the original query terms.

The other issue with this approach I'm still struggling with is simply the cost of 
creating the temporary index. I don't know if you got a chance to look at the 
"FastIndex" 
implementation I put together using TreeMaps. I was getting a 2x speed improvement 
over RAM indexes but it was still 4 times slower than the basic 
cost of tokenization used by the current highlighter code.  Costs for processing 50k 
worth of docs are as follows:
fast indexing : 1182 ms
ramindexing : 2413 ms
just tokenizing :  310 ms

Still quite an overhead and I couldn't see any obvious means of improving on this.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my 
implementation :-)
Great!  Glad to hear it was useful.

BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments
and then querying it to get the best fragments. This is nice in theory but could fail to find anything if the query is of these forms:
a AND b 
"a b" 
When the code that breaks a doc into "sentence docs" splits co-occuring "a" and "b" terms into seperate docs
this would produce no match. I dont think there's an easy way round that so I'll stick to the current approach of scoring
fragments simply based on terms found in the query.
You could, if you fail to find any fragments that match the entire 
query, re-query the fragments with a flattened query containing just an 
OR of all of the original query terms.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Demoting results

2004-03-29 Thread markharw00d
Hi Doug,
Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally 
useful than my 
implementation :-)
Unless anyone has a particularly good reason I'll remove the link to my code that 
Stephane put on the Wiki contributions page. 
I definitely find BoostingQuery very useful and would be happy to see it in Lucene 
core but I'm not sure its popular 
enough to warrant adding special support to the query parser.  

BTW, I've had a thought about your suggestion for making the highlighter use some form 
of RAMindex of sentence fragments
and then querying it to get the best fragments. This is nice in theory but could fail 
to find anything if the query is of these forms:
a AND b 
"a b" 
When the code that breaks a doc into "sentence docs" splits co-occuring "a" and "b" 
terms into seperate docs
this would produce no match. I dont think there's an easy way round that so I'll stick 
to the current approach of scoring
fragments simply based on terms found in the query.


Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Overriding coordination

2004-03-29 Thread Doug Cutting
Boris Goldowsky wrote:
I have a situation where I'm querying for something in several fields,
with a clause similar to this:
  (title:(two words)^20  keywords:(two words)^10  body:(two words))
Some good documents are being scored too low if the query terms do not
occur in the "body" field.  I naively thought that would only make a few
% difference, because of the large boosts on the title and keywords
fields, but in fact the document loses 1/3 of its score because of the
coordination term (2/3 rather than 1, because only 2 out of the three
clauses matched).
Now, I love the coordination term for the multiple-word queries
(including the ones embedded in the query above), but for the
conjunction of the different fields I'd like to remove it, and just have
each clause add its score.  I feel like there's a way to do this,
perhaps with a custom Similarity subclass, but I can't quite see how to
set it up.
This is possible in the current CVS, and will be possible in 1.4.

I attached an example to a recent email:

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=7439

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Overriding coordination

2004-03-29 Thread Boris Goldowsky
I have a situation where I'm querying for something in several fields,
with a clause similar to this:
  (title:(two words)^20  keywords:(two words)^10  body:(two words))

Some good documents are being scored too low if the query terms do not
occur in the "body" field.  I naively thought that would only make a few
% difference, because of the large boosts on the title and keywords
fields, but in fact the document loses 1/3 of its score because of the
coordination term (2/3 rather than 1, because only 2 out of the three
clauses matched).

Now, I love the coordination term for the multiple-word queries
(including the ones embedded in the query above), but for the
conjunction of the different fields I'd like to remove it, and just have
each clause add its score.  I feel like there's a way to do this,
perhaps with a custom Similarity subclass, but I can't quite see how to
set it up.

Can anyone point me in the right direction, or perhaps suggest a
different pathway that I'm missing?

Thanks a lot,

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote:
We're using lucene with one large target index which right now is 5G.  
Every night we take sub-indexes which are about 500M and merging them 
into this main index.  This merge (done via 
IndexWriter.addIndexes(Directory[]) is taking way too much time.

Looking at the stats for the box we're essentially blocked on reads.  
The disk is blocked on read IO and CPU is at 5%.  If I'm right I think 
this could be minimized by continually picking the two smaller indexes, 
merging them, then picking the next two smallest, merging them, and then 
keep doing this until we're down to one index.

Does this sound about right?
I don't think this will make things much faster.  You'll do somewhat 
fewer seeks, but you'll have to make log(N) passes over all of the data, 
about three or four in your case.  Merging ten indexes in a single pass 
should be fastest, as all of the data is only processed once, but the 
read-ahead on each file needs to be sufficient so that i/o is not 
dominated by seeks.  Can you use iostat or somesuch to find how many 
seeks/second you're seeing on the device?  Also, what's the average 
transfer rate?  Is it anywhere near the disk's capacity?  Finally, if 
possible, write the merged index to a different drive.  Reading the 
inputs from different drives may help as well.

One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you spend 
more time doing transfers with less wasted on seeks.  If that helps, 
then perhaps we ought to make this settable via system property or somesuch.

Cheers,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4 - lobby for final release

2004-03-29 Thread Doug Cutting
Charlie Smith wrote:
I'll vote yes  please release new version with "too many files open" fixed.
There is no "too many files open bug", except perhaps in your 
application.  It is however an easy to encounter problem if you don't 
close indexes or if you change Lucene's default parameters.  It will be 
considerably harder to make happen in 1.4, to keep so many people from 
shooting themselves in the foot.

Also, releases are not made by popular election.  They are made by 
volunteer developer when deemeed appropriate.  If you'd like to get more 
involved in Lucene's development, please contribute constructive efforts 
to the lucene-dev mailing list.

Maybe default the setUserCompoundFile(true) to true on this go around.
This was discussed at lenght on the developer mailing list a while back. 
 The change has been made and will be present in 1.4.

Otherwise, how can I get 1.3-RC2?  I can't seem to locate it.
The second hit for a Google search on "lucene 1.3RC2" reveals:

  http://www.apachenews.org/archives/000134.html

These search engines sure are amazing, aren't they!

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
I have not been able to work out how to get custom coordination going to 
demote results based on a specific term [ ... ]
Yeah, it's a little more complicated than perhaps it should be.

I've attached a class which does this.  I think it's faster and more 
effective than what you proposed.  This only works in the 1.4 codebase 
(current CVS), as it requires the new Query.getSimilarity() method.

To use this, change the line in your test program from:

  Query balancedQuery =
NegatingQuery.createQuery(positiveQuery,negativeQuery,1);
to

  Query balancedQuery =
new BoostingQuery(positiveQuery, negativeQuery, 0.01f);
Please tell if you find it useful.

Doug
import java.io.IOException;
import org.apache.lucene.search.*;
import org.apache.lucene.index.*;

/** Boosts the results of a query when a second query also matches.*/
public class BoostingQuery extends Query {
  private float boost;// the amount to boost by
  private Query match;// query to match
  private Query context;  // boost when matches too

  public BoostingQuery(Query match, Query context, float boost) {
this.match = match;
this.context = (Query)context.clone();// clone before boost
this.boost = boost;

context.setBoost(0.0f);  // ignore context-only matches
  }

  public Query rewrite(IndexReader reader) throws IOException {
BooleanQuery result = new BooleanQuery() {

  public Similarity getSimilarity(Searcher searcher) {
return new DefaultSimilarity() {

  public float coord(int overlap, int max) {
switch (overlap) {

case 1:   // matched only one clause
  return 1.0f;// use the score as-is

case 2:   // matched both clauses
  return boost;   // multiply by boost

default:
  return 0.0f;
  
}
  }
};
  }
};

result.add(match, true, false);
result.add(context, false, false);

return result;
  }

  public String toString(String field) {
return match.toString(field) + "/" + context.toString(field);
  }
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Patchs for RussianAnalyzer

2004-03-29 Thread Erik Hatcher
Vladimir,

I have just taken a look at your submitted patches.  I have no  
objections to making Cp1251 the default charset used in the no-arg  
constructor to RussianAnalyzer, but all of your other changes are  
formatting along with the addition of some other constructors.

Could you please provide a functionality-only diff for your patches,  
preferably in a single file attached to a Bugzilla issue?

Thanks,
Erik
On Mar 17, 2004, at 8:25 AM, Vladimir Yuryev wrote:

Dear developers!

The user using RussianAnalyzer writes to you of Lucene. There is one  
problem at work only with it of Analyzer it is parameter of the  
Russian coding (you it know as the set of the code tables for one  
language always causes admiration). East Europe or the population the  
using applied programs in Russian use the coding windows-1251 as basic  
or widely widespread client a platform MS Windows. There is an opinion  
to update constructor without parameters establishing default  
"Cp1251".

See attached file: RussianAnalyzerPatchs.tgz
RussianAnalyzer.java.path
RussianLetterTokenizer.java.patch
RussianLowerCaseFilter.java.patch
RussianStemFilter.java.patch
TestRussianAnalyzer.java.path
Such updating will remove mess (for the beginners in Lucene or  
beginners of Russian) and will facilitate use Analyzers at switchings  
multilanguage search.
Regards,
Vladimir Yuryev.
 
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Tracking/Monitoring Search Terms in Lucene

2004-03-29 Thread Katie Lord
I am trying to figure out how to track the search terms that visitors are
using on our site on a monthly basis. Do you all have any suggestions?

Thanks!

Katie Lord
[EMAIL PROTECTED]   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Erik Hatcher
On Mar 29, 2004, at 8:41 AM, Erik Hatcher wrote:
On Mar 29, 2004, at 4:25 AM, Kevin A. Burton wrote:
I have a 7G index.  A query for a random term comes back fast (300ms) 
when I'm not using a DateFilter but when I add the DateFilter it 
takes 2.6 seconds.  Way too long.  I assume this is because the 
filter API does a post process so it has to read fields off disk.

Is it possible to do to this with a RangeQuery.  For example you 
could create a "days since January 1, 1970" fields and do a range 
query from between 5 and 10... and then add the original field as 
well.
Are you keeping DateFilter around for more than one search?  The 
drawback to pure DateFilter is that it does not cache, so each search 
re-enumerates the terms in the range.  In fact, DateFilter by itself 
is practically of no use, I think.

If you have a set of canned date ranges, there are two approaches 
worth considering:  DateFilter wrapped by a CachingWrappingFilter, or 
a RangeQuery wrapped in a QueryFilter (which does cache).

Performance-wise, I don't really think there is much (any?) difference 
in these two approaches, so take your pick.  Once the bit sets are 
cached in a filter, searches will be quite fast.
One more point... caching is done by the IndexReader used for the 
search, so you will need to keep that instance (i.e. the IndexSearcher) 
around to benefit from the caching.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Erik Hatcher
On Mar 29, 2004, at 4:25 AM, Kevin A. Burton wrote:
I have a 7G index.  A query for a random term comes back fast (300ms) 
when I'm not using a DateFilter but when I add the DateFilter it takes 
2.6 seconds.  Way too long.  I assume this is because the filter API 
does a post process so it has to read fields off disk.

Is it possible to do to this with a RangeQuery.  For example you could 
create a "days since January 1, 1970" fields and do a range query from 
between 5 and 10... and then add the original field as well.
Are you keeping DateFilter around for more than one search?  The 
drawback to pure DateFilter is that it does not cache, so each search 
re-enumerates the terms in the range.  In fact, DateFilter by itself is 
practically of no use, I think.

If you have a set of canned date ranges, there are two approaches worth 
considering:  DateFilter wrapped by a CachingWrappingFilter, or a 
RangeQuery wrapped in a QueryFilter (which does cache).

Performance-wise, I don't really think there is much (any?) difference 
in these two approaches, so take your pick.  Once the bit sets are 
cached in a filter, searches will be quite fast.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What happened with build.xml in CVS?

2004-03-29 Thread Erik Hatcher
Cool... my sinister plan of subversively getting the world to upgrade 
to Ant 1.6 is working!  :)

	Erik

On Mar 29, 2004, at 4:34 AM, Rob Oxspring wrote:

Looks like Erik's commits 2 days back have up'd the depencancy from 
ant 1.5 to 1.6.  Previously only selected tasks were allowed outside 
of targets and tstamp doesn't look like one of them.

Rob

Vladimir Yuryev wrote:
Hi !
I have made latest update from lucene CVS, in which build.xml has 
problems:
Buildfile: /home/vyuryev/workspace/jakarta-lucene/build.xml
BUILD FAILED: 
file:/home/vyuryev/workspace/jakarta-lucene/build.xml:11: Unexpected 
element "tstamp"
Total time: 297 milliseconds
Best Regards,
Vladimir Yuryev
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What happened with build.xml in CVS?

2004-03-29 Thread Vladimir Yuryev
Thanks Rob, works now.
Vladimir
On Mon, 29 Mar 2004 10:34:44 +0100
 Rob Oxspring <[EMAIL PROTECTED]> wrote:
Looks like Erik's commits 2 days back have up'd the depencancy from 
ant 1.5 to 1.6.  Previously only selected tasks were allowed outside 
of targets and tstamp doesn't look like one of them.

Rob

Vladimir Yuryev wrote:
Hi !

I have made latest update from lucene CVS, in which build.xml has 
problems:

Buildfile: /home/vyuryev/workspace/jakarta-lucene/build.xml
BUILD FAILED: 
file:/home/vyuryev/workspace/jakarta-lucene/build.xml:11: 
Unexpected element "tstamp"
Total time: 297 milliseconds

Best Regards,
Vladimir Yuryev
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Special Characters

2004-03-29 Thread Gabriela D

Dear All,

I have modified StandardTokenizer.jj to include many of special character in the 
token. However, still the problem exists as the index does not include the characters 
":,{,},(,)" and "%". Can somebody help me as to what should I modify so that index 
will include words with these special characters?

Your help is appreciated.

Thanks and Regards,

Harsha



-
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.

Re: What happened with build.xml in CVS?

2004-03-29 Thread Rob Oxspring
Looks like Erik's commits 2 days back have up'd the depencancy from ant 
1.5 to 1.6.  Previously only selected tasks were allowed outside of 
targets and tstamp doesn't look like one of them.

Rob

Vladimir Yuryev wrote:
Hi !

I have made latest update from lucene CVS, in which build.xml has problems:

Buildfile: /home/vyuryev/workspace/jakarta-lucene/build.xml
BUILD FAILED: file:/home/vyuryev/workspace/jakarta-lucene/build.xml:11: 
Unexpected element "tstamp"
Total time: 297 milliseconds

Best Regards,
Vladimir Yuryev
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Kevin A. Burton
I have a 7G index.  A query for a random term comes back fast (300ms) 
when I'm not using a DateFilter but when I add the DateFilter it takes 
2.6 seconds.  Way too long.  I assume this is because the filter API 
does a post process so it has to read fields off disk.

Is it possible to do to this with a RangeQuery.  For example you could 
create a "days since January 1, 1970" fields and do a range query from 
between 5 and 10... and then add the original field as well.

I have to make some app changes so I figured I would ask here before 
moving forward.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Special Characters

2004-03-29 Thread Harshavardhan . NM
--

This email is confidential. If you are not the addressee tell the sender immediately 
and destroy this email without using, sending or storing it. Emails are not secure and 
may suffer errors, viruses, delay, interception and amendment. Standard Chartered PLC 
and subsidiaries ("SCGroup") do not accept liability for damage caused by this email 
and may monitor email traffic. Unless expressly stated, any opinions are the sender's 
and are not approved by SCGroup and this email is not an offer, solicitation, 
recommendation or agreement of any kind.



Standard Chartered Bank ("SCB") is a member of SCGroup incorporated in England with 
limited liability. SCB's principal office is 1 Aldermanbury Square, London, EC2V 7SB, 
UK. SCB is authorised and regulated by the Financial Services Authority ("FSA") and in 
FSA's register under no. 114276. SCB's VAT no. is GB244106593. FSA is the lead 
regulator for the SCGroup. For regulators in other countries contact the local 
compliance officer.

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]