Re: Lucene shouldn't use java.io.tmpdir

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 09:04, Morus Walter wrote:

 Lucene might work around this by creating a directory in java.io.tmpdir
 setting apropriate permission (can that be done with java os
 independently?) and put the lock there.

But if everybody can delete your lock files, that would be a security 
problem. Deleting stale locks isn't a problem, but how would one decide if 
a lock is stale?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Martin . Stein
Hi Kevin,

thanks for your answer. That could really solve the problem with the
modificationDate or similar fields.

But what if you create queries that ultimately return only a few hits but
contain a RangeQuery that searches for example an ID-Field of some kind,
where you have to cover a wide range of IDs? I think in general, you will
always have fields that contain lots of different terms and searching even a
small range of one of these fields may lead to this Exception. 

The bottom line in my opinion is, that you have to take care for yourself,
not to create certain type of queries that could lead to this Exception. The
type of query completely depends on the index which means as the index grows
you have to restrict the ranges of more and more rangequeries.

One way would be, to catch this Exception and gracefully present a message
to the user to further restrict his query. But this could lead to some
confusion, if the user knows that he has entered some very restrictive query
in addition to some RangeQuery that internally leads to this Exception. 

What I  really would like to see are some best practices or some advice from
some users who are working with really large indices how they handle this
situation, or why they  don't have to  care about it or maybe why I am
completely missing the point ;-))


Thanks,

Martin


-Urspr√ľngliche Nachricht-
Von: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Gesendet: Donnerstag, 8. Juli 2004 21:11
An: Lucene Users List
Betreff: Re: Understanding TooManyClauses-Exception and Query-RAM-size


[EMAIL PROTECTED] wrote:

Hi,

a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything
went
smoothly, but we are experiencing some problems with that new constant
limit


   maxClauseCount=1024

which leeds to Exceptions of type 

   org.apache.lucene.search.BooleanQuery$TooManyClauses 

when certain RangeQueries are executed (in fact, we get this Excpetion when
we execute certain Wildcard queries, too). Although we are working with a
fairly small index with about 35.000 documents, we encounter this Exception
when we search for the property modificationDate. For example

   modificationDate:[00 TO 0dwc970kw] 

  

We talked about this the other day.

http://wiki.apache.org/jakarta-lucene/IndexingDateFields

Find out what type of precision you need and use that.  If you only need 
days or hours or minutes then use that.   Millis is just too small. 

We're only using days and have queries for just the last 7 days as max 
so this really works out well...

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Aviran
Hi all,
First let me explain what I found out. I'm running Lucene on a 4 CPU server.
While doing some stress tests I've noticed (by doing full thread dump) that
searching threads are blocked on the method: public FieldInfo fieldInfo(int
fieldNumber) This causes for a significant cpu idle time. 
I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList byNumber
I was able to get 110% improvement in performance (number of searches per
second).
 
My question is: do the fields byNumber and byName have to be synchronized
and what can happen if I'll change them to be ArrayList and HashMap which
are not synchronized ? Can this corrupt the index or the integrity of the
results?

Thanks,
Aviran



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Field.java - STORED, NOT_STORED, etc...

2004-07-12 Thread wallen
I have 2 suggestions:

1) use Eclipse, or an IDE that references the javadoc with mouseovers
2) if you are going to create constants, consider using a bitflag.  Then
your constants can have a 2's value, ie

STORED = 1
INDEXED = 2
TOKENIZED = 4

Then you can have the constructor look like:

new Field(name, value, STORED + TOKENIZED)

The constructor would break that down bitwise!

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 11, 2004 5:05 AM
To: Lucene Users List
Subject: Field.java - STORED, NOT_STORED, etc...


I've been working with the Field class doing index conversions between 
an old index format to my new external content store proposal (thus the 
email about the 14M convert).

Anyway... I find the whole Field.Keyword, Field.Text thing confusing.  
The main problem is that the constructor to Field just takes booleans 
and if you forget the ordering of the booleans its very confusing.

new Field( name, value, true, false, true );

So looking at that you have NO idea what its doing without fetching javadoc.

So I added a few constants to my class:

new Field( name, value, NOT_STORED, INDEXED, NOT_TOKENIZED );

which IMO is a lot easier to maintain.

Why not add these constants to Field.java:

public static final boolean STORED = true;
public static final boolean NOT_STORED = false;

public static final boolean INDEXED = true;
public static final boolean NOT_INDEXED = false;

public static final boolean TOKENIZED = true;
public static final boolean NOT_TOKENIZED = false;

Of course you still have to remember the order but this becomes a lot 
easier to maintain.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexSearcher usage and caching?

2004-07-12 Thread Joel Shellman
I'm working on a document management system using lucene to search 
through all the documents.

This means that I'll be adding/updating/deleting documents at the same 
time searches are going on.

I thought to create an IndexSearcher and reuse it throughout, but that 
doesn't seem to work. If I do a search, then add a document, and do 
another search with the same IndexSearcher, it won't find the newly 
added document.

I'd rather not have to create a new IndexSearcher for every query... do 
I have to?

Thanks,
-joel shellman
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
What I  really would like to see are some best practices or some advice from
some users who are working with really large indices how they handle this
situation, or why they  don't have to  care about it or maybe why I am
completely missing the point ;-))
Many folks with really large indexes just don't permit things like 
wildcard and range searches.  For example, Google supports no wildcards 
and has only recently added limited numeric range searching.  Yahoo! 
supports neither.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Anyone use MultiSearcher class

2004-07-12 Thread Don Vaillancourt
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher class 
takes 8 times longer than searching only one index.  I could understand if 
it took 3 to 4 times longer to search due to sorting the two search results 
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that there 
is no way for me to create my own Hits object (no methods are available and 
the class is final).

Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
First let me explain what I found out. I'm running Lucene on a 4 CPU server.
While doing some stress tests I've noticed (by doing full thread dump) that
searching threads are blocked on the method: public FieldInfo fieldInfo(int
fieldNumber) This causes for a significant cpu idle time. 
What version of Lucene are you running?  Also, can you please send the 
stack traces of the blocked threads, or at least a description of them? 
 I'd be interested to see what context this happens in.  In particular, 
which IndexReader and Searcher/Scorer/Weight methods does it happen under?

I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList byNumber
I was able to get 110% improvement in performance (number of searches per
second).
That's impressive!  Good job finding a bottleneck!
My question is: do the fields byNumber and byName have to be synchronized
and what can happen if I'll change them to be ArrayList and HashMap which
are not synchronized ? Can this corrupt the index or the integrity of the
results?
I think that is a safe change.  FieldInfos is only modifed by 
DocumentWriter and SegmentMerger, and there is no possibility of other 
threads accessing those instances.  Please submit a patch to the 
developer mailing list.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re:Anyone use MultiSearcher class

2004-07-12 Thread fp235-5
 
I think there is a ParallelMultiSearcher class that extands Multisearcher. Have
you tried it?

-- Debut du message initial ---

De : Don Vaillancourt [EMAIL PROTECTED]
A  : Lucene Users List [EMAIL PROTECTED]
Copies : 
Date   : Mon, 12 Jul 2004 12:36:29 -0400
Sujet  : Anyone use MultiSearcher class

Hello,

Has anyone used the Multisearcher class?

I have noticed that searching two indexes using this MultiSearcher class 
takes 8 times longer than searching only one index.  I could understand if 
it took 3 to 4 times longer to search due to sorting the two search results 
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that there 
is no way for me to create my own Hits object (no methods are available and 
the class is final).

Anyone have any clue?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.















-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re:Anyone use MultiSearcher class

2004-07-12 Thread Don Vaillancourt
Actually, after I implemented the MultiSeacher, I had totally forgotten 
about this class.  Although it isn't clear what I does.  I'm assuming that 
it uses threads to search multiple indexes.

I'll have to try it.
Thanks
At 01:10 PM 12/07/2004, you wrote:
I think there is a ParallelMultiSearcher class that extands Multisearcher. 
Have
you tried it?

-- Debut du message initial ---
De : Don Vaillancourt [EMAIL PROTECTED]
A  : Lucene Users List [EMAIL PROTECTED]
Copies :
Date   : Mon, 12 Jul 2004 12:36:29 -0400
Sujet  : Anyone use MultiSearcher class
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher class
takes 8 times longer than searching only one index.  I could understand if
it took 3 to 4 times longer to search due to sorting the two search results
and stuff, but why 8 times longer.
Is there some optimization that can be done to hasten the search?  Or
should I just write my own MultiSearcher.  The problem though is that there
is no way for me to create my own Hits object (no methods are available and
the class is final).
Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Re: Anyone use MultiSearcher class

2004-07-12 Thread Zilverline info
Hi Don,
Yes, I'm using the MultiSearcher (in Zilverline), and have seen no 
serious performance issues with it. The app performs well with multiple 
indexes, it's responds so quick (with 100k+ documents) that I haven't 
even taken the time to measure the difference to a single index search.
Michael Franken

Don Vaillancourt wrote:
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher 
class takes 8 times longer than searching only one index.  I could 
understand if it took 3 to 4 times longer to search due to sorting the 
two search results and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that 
there is no way for me to create my own Hits object (no methods are 
available and the class is final).

Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
I use Lucene 1.4 final
Here is the thread dump for one blocked thread (If you want a full thread
dump for all threads I can do that too)
Thanks.  I think I get the point.  I recently removed a synchronization 
point higher in the stack, so that now this one shows up!

Whether or not you submit a patch, please file a bug report in Bugzilla 
with your proposed change, so that we don't lose track of this issue.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Browse by Letter within a Category

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 17:48, O'Hare, Thomas wrote:

 Does Lucene have a beginning of line query syntax, like the regular
 expression ^ symbol? For example,
 
 title:^A*

If your title isn't tokenized the ^ is implicit, I think. As usual, if 
your title is tokenized you can easily add another field with the same 
value as title, but in untokenized form.

 What is the best way to sort by a date? I currently have a date field
 that is used for searching in the format MMDD as a Field.Keyword. 

Lucene 1.4 added an IndexSearcher.search() method that takes a Sort() 
object which lets you sort by any field. Your date field can be used for 
that, as it has the correct format (because sorting it alphabetically will 
give you the right order already).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Exact match search

2004-07-12 Thread yahootintin . 1247688
Hi,



I want to match documents that exactly equal a certain value, not just
contain it.



If I search for foo in Lucene I get back documents like these:

foo

foo bar

bar foo



Is there a way to just get the ones that exactly
equal the value I'm searching for?  In this case, I want to only return the
first document (ex. foo).



I have a workaround where I store all the values
and then after I get the hits I go through them and skip those that don't
match.  But this will return result sets of hundreds of documents that I don't
need.



Help!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exact match search

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 21:17, [EMAIL PROTECTED] wrote:

 I want to match documents that exactly equal a certain value, not just
 contain it.

Just don't tokenize your Fields, and make sure that the query also doesn't 
get tokenized (the easiest way to ensure that is probably to not use 
QueryParser but just build a TermQuery directly from the user's input).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Could search results give an idea of which field matched

2004-07-12 Thread Grant Ingersoll
See the explain functionality in the Javadocs and previous threads.  You can ask 
Lucene to explain why it got the results it did for a give hit.

 [EMAIL PROTECTED] 07/12/04 04:52PM 
I search the index on multiple fields. Could the search results also
tell me which field matched so that the document was selected? From what
I can tell, only the document number and a score are returned, is there
a way to also find out what was the field(s) of the document matched the
query?

 

Sildy

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.java - STORED, NOT_STORED, etc...

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
It would be best to get the compiler to check the order.
If we change this, why not use type-safe enumerations:
http://www.javapractices.com/Topic1.cjp
The calls would look like:
new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
private Stored() {}
public static final Stored YES = new Stored();
public static final Stored NO = new Stored();
}
+1... I'm not in love with this pattern but since Java  1.4 doesnt' 
support enum its better than nothing.

I also didn't want to submit a recommendation that would break APIs. I 
assume the old API would be deprecated?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I was going to create a new IDField class which just calls super( 
name, value, false, true, false) but noticed I was prevented because 
Field.java is final?

You don't need to subclass to do this, just a static method somewhere.
Why is this? I can't see any harm in making it non-final...

Field and Document are not designed to be extensible. They are 
persisted in such a way that added methods are not available when the 
field is restored. In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.
Thats fine... I think thats acceptable behavior. I don't think anyone 
would assume that inner vars are restored or that the field is serialized.

Not a big deal but it would be nice...
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Kevin A. Burton
Aviran wrote:
Bug 30058 posted
 

Which of course is here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30058
Is this the source of the revision you modified?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html
Also what version of Lucene?
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]