Re: Lucene search result no stable

2004-01-21 Thread Morus Walter
Ardor Wei writes: What might be the problem? How to solve it? Any suggestion or idea will be appreciated. The only problem with locking I saw so far is that you have to make sure that the temp dir is the same for all applications. Lucene 1.3 stores it's lock in the directory that is defined

Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hi Doug, thank you for the answer so far. I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. A direct match will not occur anyway. How can I make a most Vector Space Model (VSM) like query (each

Vector - LinkedList for performance reasons...

2004-01-21 Thread Kevin A. Burton
I'm looking at a lot of the code in Lucene... I assume Vector is used for legacy reasons. In an upcoming version I think it might make sense to migrate to using a LinkedList... since Vector has to do an array copy when it's exhausted. It's also synchronized which kind of sucks... I'm seeing

Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Francesco Bellomi
I agree that synchronization in Vector is a waste of time if it isn't required, but I'm not sure if LinkedList is a better (faster) choice than ArrayList. I think only a profiler could tell. Francesco Kevin A. Burton [EMAIL PROTECTED] wrote: I'm looking at a lot of the code in Lucene... I

Re: setMaxClauseCount ??

2004-01-21 Thread Andrzej Bialecki
Karl Koch wrote: Hi Doug, thank you for the answer so far. I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. A direct match will not occur anyway. How can I make a most Vector Space Model (VSM)

HTML tagged terms boosting...

2004-01-21 Thread Alexey Maksakov
Hello! Is there any idea how to achieve boosting terms in HTML-documents surrounded by HTML tags, such as B, H1, etc.? Can it be done with use of existing API or reimplemeting or implementation of TokenStream with custom Token types is needed? Though it seems to me, that even such

Re: HTML tagged terms boosting...

2004-01-21 Thread Erik Hatcher
It definitely cannot be done with custom token types. You're probably aiming for field-specific boosting, so you will need to parse the HTML into separate fields and use a multi-field search approach. I'm sure there are other tricks that could be used for boosting, like inserting the words

AW: HTML tagged terms boosting...

2004-01-21 Thread Alexey Maksakov
Thanks for answer. Yes I'm up to field specific boosting, but also I'm looking for creating short descriptions on documents found, based on query (like it is done in most search engines). I've thought about those solutions but it seemed to me that it is not straightforward and will cause troubles

Re: Query Term Questions

2004-01-21 Thread Erik Hatcher
On Jan 20, 2004, at 10:22 AM, Terry Steichen wrote: 1) Is there a way to set the query boost factor depending not on the presence of a term, but on the presence of two specific terms? For example, I may want to boost the relevance of a document that contains both iraq and clerics, but not

Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Erik, Thanks for your response. My specific comments (TS==) are inserted below. I should make clear that I'm using fairly complex, embedded queries - not ones that the user is expected to enter. Regards, Terry - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users

Re: Query Term Questions

2004-01-21 Thread Erik Hatcher
On Jan 21, 2004, at 10:01 AM, Terry Steichen wrote: But doesn't the query itself take this into account? If there are multiple matching terms then the overlap (coord) factor kicks in. TS==Except that I'd like to be able to choose to do this on a query-by-query basis. In other words, it's

Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Nicolas Toper
Hi, I'd like to help working on improving Lucene. How can I help? Le Mercredi 21 Janvier 2004 16:38, Doug Cutting a écrit : Francesco Bellomi wrote: I agree that synchronization in Vector is a waste of time if it isn't required, It would be interesting to see if such synchronization

Re: Query Term Questions

2004-01-21 Thread Morus Walter
Erik Hatcher writes: TS==I've not been able to get negative boosting to work at all. Maybe there's a problem with my syntax. If, for example, I do a search with green beret^10, it works just fine. But green beret^-2 gives me a ParseException showing a lexical error. Have you

Re: QueryParser and stopwords

2004-01-21 Thread Otis Gospodnetic
Hello Morus, --- Morus Walter [EMAIL PROTECTED] wrote: Hi, I'm currently trying to get rid of query parser problems with stopwords (depending on the query, there are ArrayIndexOutOfBoundsExceptions, e.g. for stop AND nonstop where stop is a stopword and nonstop not). While this isn't

Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hello Doug, that sounds interesting to me. I refer to a paper written by NIST about Relevance Feedback which was doing test with 20 - 200 words. This is why I thought it might be good to be able to use all non stopwords of a document for that and see what is happening. Do you know good papers

Re: setMaxClauseCount ??

2004-01-21 Thread Otis Gospodnetic
Karl: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114748 Status: several people have mentioned they wanted to work on it, but nobody has contributed any patches. The code you see at the above URL is not compatible with Lucene 1.3, but could be brought up to date. Otis --- Karl

RE: setMaxClauseCount ??

2004-01-21 Thread Chong, Herb
there are just about as many ways of doing it as there are papers that talk about automatic relevance feedback. many require domain-specific reference documents that are full of facts and therefore good sources of related words. some people use Wordnet. some of these techniques can add 400-500

Re: setMaxClauseCount ??

2004-01-21 Thread Doug Cutting
Karl Koch wrote: Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it? Lucene does let you

Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Morus, Unfortunately, using positive boost factors less than 1 causes the parser to barf the same as do negative boost factors. Regards, Terry - Original Message - From: Morus Walter [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, January 21, 2004 10:54 AM

Re: Query Term Questions

2004-01-21 Thread Erik Hatcher
On Jan 21, 2004, at 4:21 PM, Terry Steichen wrote: PS: Is this in the docs? If not, maybe it should be mentioned. Depends on what you consider the docs. I looked at QueryParser.jj to see what it parses. Also, on http://jakarta.apache.org/lucene/docs/queryparsersyntax.html it has an example of

Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Tatu Saloranta
On Wednesday 21 January 2004 08:38, Doug Cutting wrote: Francesco Bellomi wrote: I agree that synchronization in Vector is a waste of time if it isn't required, It would be interesting to see if such synchronization actually impairs overall performance significantly. This would be fairly

1.3-final: now giving me java.io.FileNotFoundException (Too many open files)

2004-01-21 Thread Matt Quail
I'm getting the following stack trace from lucene-1.3-final running on JDK 1.4.2_03-b02 on linux java.io.FileNotFoundException: /home/matt/blah/idx/_123n.tis (Too many open files) at java.io.RandomAccessFile.open(Native Method) at