Re: Lucene search result no stable

2004-01-21 Thread Morus Walter
Ardor Wei writes:
 
 What might be the problem? How to solve it?
 Any suggestion or idea will be appreciated.
 
The only problem with locking I saw so far is that you have
to make sure that the temp dir is the same for all applications.
Lucene 1.3 stores it's lock in the directory that is defined by the
system property java.io.tmpdir.
I had one component running under tomcat and one from the shell
and they used different temp dirs which is fatal in this case.

Apart from this it depends pretty much on your environment.
I'm using lucene on linux on local filesystems. Other operating
systems or network filesystems may influence locking.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hi Doug,

thank you for the answer so far.

I actually wanted to add a large amount of text from an existing document to
find a close related one. Can you suggest another good way of doing this. A
direct match will not occur anyway. How can I make a most Vector Space Model
(VSM) like query (each word a dimension value - find documents close to
that)? You know as good as I that the standard VSM does not have any Boolean logic
inside... how do I need to formuate the query to make it as much similar to
a vector in order to find similar document in the vector space of the Lucene
index?

Cheers,
Karl

 setMaxClauseCount determines the maximum number of clauses, which is not 
 your problem here.  Your problem is with required clauses.  There may 
 only be a total of 31 required (or prohibited) clauses in a single 
 BooleanQuery.  If you need more, then create more BooleanQueries and 
 combine them with another BooleanQuery.  Perhaps this could be done 
 automatically, but I've never heard anyone encounter this limit before. 
   Do you really mean for 32 different terms to be required?  Do any 
 documents actually match this query?
 
 Doug
 
 Karl Koch wrote:
  Hi group,
  
  I run over a IndexOutOfBoundsException:
  
  - java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
  clauses in query.
  
  The reason: I have more then 32 BooleanCauses. From the Mailinglist I
 got
  the info how to set the maxiumum number of clauses higher before a loop:
  
  ...
  myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
  while (true){
Token token = tokenStream.next();
if (token == null) {
  break;
}
myBooleanQuery.add(new TermQuery(new Term(bla, token.termText())),
 true,
  false);
  } ... 
  
  However the error still remains, why?
  
  Karl
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Vector - LinkedList for performance reasons...

2004-01-21 Thread Kevin A. Burton
I'm looking at a lot of the code in Lucene... I assume Vector is used 
for legacy reasons.  In an upcoming version I think it might make sense 
to migrate to using a LinkedList... since Vector has to do an array copy 
when it's exhausted.

It's also synchronized which kind of sucks...

I'm seeing this being used in a lot of tight loops so things might be 
sped up a bit by using Collections ...

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   Dean in 2004! - http://blog.deanforamerica.com/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Francesco Bellomi
I agree that synchronization in Vector is a waste of time if it isn't
required,
but I'm not sure if LinkedList is a better (faster) choice than ArrayList. I
think only a profiler could tell.

Francesco


Kevin A. Burton [EMAIL PROTECTED] wrote:
 I'm looking at a lot of the code in Lucene... I assume Vector is used
 for legacy reasons.  In an upcoming version I think it might make
 sense to migrate to using a LinkedList... since Vector has to do an
 array copy when it's exhausted.

 It's also synchronized which kind of sucks...

 I'm seeing this being used in a lot of tight loops so things might be
 sped up a bit by using Collections ...

 Kevin

 --

 Please reply using PGP:

 http://peerfear.org/pubkey.asc

 NewsMonster - http://www.newsmonster.org/
 Dean in 2004! - http://blog.deanforamerica.com/

 Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster






 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
Francesco Bellomi
Use truth to show illusion,
and illusion to show truth.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Andrzej Bialecki
Karl Koch wrote:

Hi Doug,

thank you for the answer so far.

I actually wanted to add a large amount of text from an existing document to
find a close related one. Can you suggest another good way of doing this. A
direct match will not occur anyway. How can I make a most Vector Space Model
(VSM) like query (each word a dimension value - find documents close to
that)? You know as good as I that the standard VSM does not have any Boolean logic
inside... how do I need to formuate the query to make it as much similar to
a vector in order to find similar document in the vector space of the Lucene
index?
You should try to reduce the dimensionality by reducing the number of 
unique features. In this case, you could for example use only keywords 
(or key phrases) instead of the full content of documents.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


HTML tagged terms boosting...

2004-01-21 Thread Alexey Maksakov
Hello!

Is there any idea how to achieve boosting terms in HTML-documents surrounded
by HTML tags, such as B, H1, etc.?

Can it be done with use of existing API or reimplemeting or implementation
of TokenStream with custom Token types is needed?

Though it seems to me, that even such re-implementation won't help without
changing indexing and searcher code... Hope that I'm wrong.

Thanks in advance.

Alexey.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML tagged terms boosting...

2004-01-21 Thread Erik Hatcher
It definitely cannot be done with custom token types.  You're probably 
aiming for field-specific boosting, so you will need to parse the HTML 
into separate fields and use a multi-field search approach.

I'm sure there are other tricks that could be used for boosting, like 
inserting the words inside b multiple times into the same field for 
example.

	Erik

On Jan 21, 2004, at 6:50 AM, Alexey Maksakov wrote:

Hello!

Is there any idea how to achieve boosting terms in HTML-documents 
surrounded
by HTML tags, such as B, H1, etc.?

Can it be done with use of existing API or reimplemeting or 
implementation
of TokenStream with custom Token types is needed?

Though it seems to me, that even such re-implementation won't help 
without
changing indexing and searcher code... Hope that I'm wrong.

Thanks in advance.

Alexey.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


AW: HTML tagged terms boosting...

2004-01-21 Thread Alexey Maksakov
Thanks for answer.

Yes I'm up to field specific boosting, but also I'm looking for creating
short descriptions on documents found, based on query (like it is done in
most search engines). I've thought about those solutions but it seemed to me
that it is not straightforward and will cause troubles when building
results' description. On second thought answer was found - analyze document
as stream and put terms into separate fields (or create duplicates) while
maintaining original offsets in Token objects.

After that building description is quite simple - just using TermPositions
in IndexReader and than getting corresponding text portion(s) from Field
(sadly it'll work only in case of one body field - so only duplicates are
usable, several Fields I think will require extra unindexed body Field to
fetch document pieces fast).

Hope I've not missed anything... Hm... not transparent, it is. :-) just hope
it helps somebody else.

 Erik Hatcher [EMAIL PROTECTED]
 21.01.2004 15:27
 Please respond to Lucene Users List

 To: Lucene Users List [EMAIL PROTECTED]
 cc:
 Subject:Re: HTML tagged terms boosting...


 It definitely cannot be done with custom token types.  You're probably
 aiming for field-specific boosting, so you will need to parse the HTML
 into separate fields and use a multi-field search approach.

 I'm sure there are other tricks that could be used for boosting, like
 inserting the words inside b multiple times into the same field for
 example.

  Erik


 On Jan 21, 2004, at 6:50 AM, Alexey Maksakov wrote:

  Hello!
 
  Is there any idea how to achieve boosting terms in HTML-documents
  surrounded
  by HTML tags, such as B, H1, etc.?
 
  Can it be done with use of existing API or reimplemeting or
  implementation
  of TokenStream with custom Token types is needed?
 
  Though it seems to me, that even such re-implementation won't help
  without
  changing indexing and searcher code... Hope that I'm wrong.
 
  Thanks in advance.
 
  Alexey.
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Erik Hatcher
On Jan 20, 2004, at 10:22 AM, Terry Steichen wrote:
1) Is there a way to set the query boost factor depending not on the 
presence of a term, but on the presence of two specific terms?  For 
example, I may want to boost the relevance of a document that contains 
both iraq and clerics, but not boost the relevance of documents 
that contain only one or the other terms. (The idea is better 
discrimination than if I simply boosted both terms.)
But doesn't the query itself take this into account?  If there are 
multiple matching terms then the overlap (coord) factor kicks in.

2) Is it possible to apply (or simulate) a negative query boost 
factor?  For example, I may have a complex query with lots of terms 
but want to reduce the relevance of a matching document that also 
included the term iowa. ( The idea is for an easier and more 
discriminating way than simply increasing the relevance of all other 
terms besides iowa).
Another reply mentioned negative boosting.  Is that not working as 
you'd like?

3) Is there a way to handle variants of a phrase without OR'ing 
together the variants?  For example, I may want to find documents 
dealing with North Korea; the terms might be north korea or north 
korean or north koreans - is there a way to handle this with a 
single term using wildcards?
Sounds like what you're really after is fancier analysis.  This is one 
of the purposes of analysis, to do stemming.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Erik,

Thanks for your response.  My specific comments (TS==) are inserted below.
I should make clear that I'm using
fairly complex, embedded queries - not ones that the user is expected to
enter.

Regards,

Terry

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, January 21, 2004 9:31 AM
Subject: Re: Query Term Questions


 On Jan 20, 2004, at 10:22 AM, Terry Steichen wrote:
  1) Is there a way to set the query boost factor depending not on the
  presence of a term, but on the presence of two specific terms?  For
  example, I may want to boost the relevance of a document that contains
  both iraq and clerics, but not boost the relevance of documents
  that contain only one or the other terms. (The idea is better
  discrimination than if I simply boosted both terms.)

 But doesn't the query itself take this into account?  If there are
 multiple matching terms then the overlap (coord) factor kicks in.

TS==Except that I'd like to be able to choose to do this on a
query-by-query basis.  In other words,
it's desirable that some specific queries significantly increase their
discrimination based on this multiple matching,
relative to the normal extra boost given by the coord factor.  However, I
take it from your answer that
there's not a way to do this in the query itself (at least using the
unmodified, standard Lucene version).


  2) Is it possible to apply (or simulate) a negative query boost
  factor?  For example, I may have a complex query with lots of terms
  but want to reduce the relevance of a matching document that also
  included the term iowa. ( The idea is for an easier and more
  discriminating way than simply increasing the relevance of all other
  terms besides iowa).

 Another reply mentioned negative boosting.  Is that not working as
 you'd like?

TS==I've not been able to get negative boosting to work at all.  Maybe
there's a problem with my syntax.
If, for example, I do a search with green beret^10, it works just fine.
But green beret^-2 gives me a
ParseException showing a lexical error.


  3) Is there a way to handle variants of a phrase without OR'ing
  together the variants?  For example, I may want to find documents
  dealing with North Korea; the terms might be north korea or north
  korean or north koreans - is there a way to handle this with a
  single term using wildcards?

 Sounds like what you're really after is fancier analysis.  This is one
 of the purposes of analysis, to do stemming.

TS==Well, I hope I'm not trying to be fancy.  It's just that listing all
the different variants, particularly (as in my
case) I have to do this for multiple fields, gets tedious and error-prone.
The example above is simply one such case
for a particular query - other queries may have entirely different desired
combinations.  Constructing a single stemmer
to handle all such cases would be (for me, at least) very difficult.
Besides, I tend to stay away from stemming because
I believe it can introduce some rather unpredictable side-effects.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Erik Hatcher
On Jan 21, 2004, at 10:01 AM, Terry Steichen wrote:
But doesn't the query itself take this into account?  If there are
multiple matching terms then the overlap (coord) factor kicks in.
TS==Except that I'd like to be able to choose to do this on a
query-by-query basis.  In other words,
it's desirable that some specific queries significantly increase their
discrimination based on this multiple matching,
relative to the normal extra boost given by the coord factor.  
However, I
take it from your answer that
there's not a way to do this in the query itself (at least using the
unmodified, standard Lucene version).
Don't interpret my replies as being absolute here - I'm still learning 
lots about Lucene and am open to being shown new ways of doing things 
with it.

Another reply mentioned negative boosting.  Is that not working as
you'd like?
TS==I've not been able to get negative boosting to work at all.  Maybe
there's a problem with my syntax.
If, for example, I do a search with green beret^10, it works just 
fine.
But green beret^-2 gives me a
ParseException showing a lexical error.
Have you tried it without using QueryParser and boosting a Query using 
setBoost on it?  QueryParser is a double-edged sword and it looks like 
it only allows numeric characters (plus . followed by numeric 
characters).  So QueryParser has the problem with negative boosts, but 
not Query itself.

Sounds like what you're really after is fancier analysis.  This is one
of the purposes of analysis, to do stemming.
TS==Well, I hope I'm not trying to be fancy.  It's just that listing 
all
the different variants, particularly (as in my
case) I have to do this for multiple fields, gets tedious and 
error-prone.
The example above is simply one such case
for a particular query - other queries may have entirely different 
desired
combinations.  Constructing a single stemmer
to handle all such cases would be (for me, at least) very difficult.
Besides, I tend to stay away from stemming because
I believe it can introduce some rather unpredictable side-effects.
I'd still recommend trying some of the other analyzer options out there 
and seeing if you can tweak things to your liking.  This is really the 
answer for what you are after, I'm almost certain.  Good stemmers exist 
- look at the Porter one or the Snowball ones.  Write some test cases 
to analyze the analyzer like I did in my java.net articles - it 
really will let you experiment with indexing and searching easily.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Nicolas Toper
Hi,
I'd like to help working on improving Lucene. How can I help?
Le Mercredi 21 Janvier 2004 16:38, Doug Cutting a écrit :
 Francesco Bellomi wrote:
  I agree that synchronization in Vector is a waste of time if it isn't
  required,

 It would be interesting to see if such synchronization actually impairs
 overall performance significantly.  This would be fairly simple to test.

  but I'm not sure if LinkedList is a better (faster) choice than
  ArrayList.

 Correct.  ArrayList is the substitute for Vector.  One could also try
 replacing Hashtable with HashMap in many places.

   I think only a profiler could tell.

 I wouldn't trust a profiler for this.  Rather, I'd perform benchmarks
 before and after the change will best show real performance.  A
 substantial indexing benchmark and some search benchmarks, searching
 fairly large indexes would be good.

 My hunch is that the speedup will not be significant.  Synchronization
 costs in modern JVMs are very small when there is no contention.  But
 only measurement can say for sure.

 Doug


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Morus Walter
Erik Hatcher writes:
 
  TS==I've not been able to get negative boosting to work at all.  Maybe
  there's a problem with my syntax.
  If, for example, I do a search with green beret^10, it works just 
  fine.
  But green beret^-2 gives me a
  ParseException showing a lexical error.
 
 Have you tried it without using QueryParser and boosting a Query using 
 setBoost on it?  QueryParser is a double-edged sword and it looks like 
 it only allows numeric characters (plus . followed by numeric 
 characters).  So QueryParser has the problem with negative boosts, but 
 not Query itself.

He said he wants to have one term less important than others (at least
that's what I understood).
That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1) 
and might be called 'negative boosting' (such as breking is a form of 
negative acceleration).

If you use negative boost factors you would even decrease the score of
a match (not only increase it less) and risk of ending with a negative
score. I don't think that would be a good idea.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser and stopwords

2004-01-21 Thread Otis Gospodnetic
Hello Morus,

--- Morus Walter [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm currently trying to get rid of query parser problems with
 stopwords
 (depending on the query, there are ArrayIndexOutOfBoundsExceptions,
 e.g. for stop AND nonstop where stop is a stopword and nonstop not).
 
 While this isn't hard to fix (I'll enter a bug and patch in
 bugzilla), 

There is already a bug report open for this.  A very old one, too!

 there's one issue left, I'm not sure how to deal with:

 What should the query parser return for a query string containing
 only stopwords?

null?

 And when I think about this, there's another one:
 stop AND NOT nonstop
 creates a boolean query, only containing prohibited terms, which
 AFAIK cannot be used in a search. How to deal with this?
 
 Currently it returns an empty BooleanQuery.
 I think it would be more useful to return null in this case.

Either one should be okay.  null, to be consistent with above.

Looking forward to the patch for this OLD bug.
Otis


__
Do you Yahoo!?
Yahoo! Hotjobs: Enter the Signing Bonus Sweepstakes
http://hotjobs.sweepstakes.yahoo.com/signingbonus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hello Doug,

that sounds interesting to me. I refer to a paper written by NIST about
Relevance Feedback which was doing test with 20 - 200 words. This is why I
thought it might be good to be able to use all non stopwords of a document for that
and see what is happening. Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?

Using term frequencies of the document is not really possible since lucene
is not providing access to a document vector, isn't it?

By the way, could you send me the code of Dmitry about the Vector extension.
I have been asking in another thread but I did not get it so far. I really
would like to have a look... Also it would be nice to know about any status
regarding the progress of integrating it in Lucene 1.3. Who is working on it
and how could I contribute?

Cheers,
Karl


 Andrzej Bialecki wrote:
  Karl Koch wrote:
  I actually wanted to add a large amount of text from an existing 
  document to
  find a close related one. Can you suggest another good way of doing 
  this.
 
  You should try to reduce the dimensionality by reducing the number of 
  unique features. In this case, you could for example use only keywords 
  (or key phrases) instead of the full content of documents.
 
 Indeed, this is a good approach.  In my experience, six or eight terms 
 are usually enough, and they needn't all be required.
 
 Doug
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Otis Gospodnetic
Karl:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114748

Status: several people have mentioned they wanted to work on it, but
nobody has contributed any patches.  The code you see at the above URL
is not compatible with Lucene 1.3, but could be brought up to date.

Otis

--- Karl Koch [EMAIL PROTECTED] wrote:
 Hello Doug,
 
 that sounds interesting to me. I refer to a paper written by NIST
 about
 Relevance Feedback which was doing test with 20 - 200 words. This is
 why I
 thought it might be good to be able to use all non stopwords of a
 document for that
 and see what is happening. Do you know good papers about strategies
 of how
 to select keywords effectivly beyond the scope of stopword lists and
 stemming?
 
 Using term frequencies of the document is not really possible since
 lucene
 is not providing access to a document vector, isn't it?
 
 By the way, could you send me the code of Dmitry about the Vector
 extension.
 I have been asking in another thread but I did not get it so far. I
 really
 would like to have a look... Also it would be nice to know about any
 status
 regarding the progress of integrating it in Lucene 1.3. Who is
 working on it
 and how could I contribute?
 
 Cheers,
 Karl
 
 
  Andrzej Bialecki wrote:
   Karl Koch wrote:
   I actually wanted to add a large amount of text from an existing
 
   document to
   find a close related one. Can you suggest another good way of
 doing 
   this.
  
   You should try to reduce the dimensionality by reducing the
 number of 
   unique features. In this case, you could for example use only
 keywords 
   (or key phrases) instead of the full content of documents.
  
  Indeed, this is a good approach.  In my experience, six or eight
 terms 
  are usually enough, and they needn't all be required.
  
  Doug
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 -- 
 +++ GMX - die erste Adresse für Mail, Message, More +++
 Bis 31.1.: TopMail + Digicam für nur 29 EUR
 http://www.gmx.net/topmail
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Hotjobs: Enter the Signing Bonus Sweepstakes
http://hotjobs.sweepstakes.yahoo.com/signingbonus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: setMaxClauseCount ??

2004-01-21 Thread Chong, Herb
there are just about as many ways of doing it as there are papers that talk about 
automatic relevance feedback. many require domain-specific reference documents that 
are full of facts and therefore good sources of related words. some people use 
Wordnet. some of these techniques can add 400-500 terms into a query if they are 
searching long documents and using reference documents that are equally long. the 
technique is very important only when searching long documents and almost irrelevant 
for very short ones.

Herb

-Original Message-
From: Karl Koch [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 21, 2004 11:09 AM
To: Lucene Users List
Subject: Re: setMaxClauseCount ??

that sounds interesting to me. I refer to a paper written by NIST about
Relevance Feedback which was doing test with 20 - 200 words. This is why I
thought it might be good to be able to use all non stopwords of a document for that
and see what is happening. Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Doug Cutting
Karl Koch wrote:
Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?
Using term frequencies of the document is not really possible since lucene
is not providing access to a document vector, isn't it?
Lucene does let you access the document frequency of terms, with 
IndexReader.docFreq().  Term frequencies can be computed by 
re-tokenizing the text, which, for a single document, is usually fast 
enough.  But looking up the docFreq() of every term in the document is 
probably too slow.

You can use some heuristics to prune the set of terms, to avoid calling 
docFreq() too much, or at all.  Since you're trying to maximize a tf*idf 
score, you're probably most interested in terms with a high tf. 
Choosing a tf threshold even as low as two or three will radically 
reduce the number of terms under consideration.  Another heuristic is 
that terms with a high idf (i.e., a low df) tend to be longer.  So you 
could threshold the terms by the number of characters, not selecting 
anything less than, e.g., six or seven characters.  With these sorts of 
heuristics you can usually find small set of, e.g., ten or fewer terms 
that do a pretty good job of characterizing a document.

It all depends on what you're trying to do.  If you're trying to eek out 
that last percent of precision and recall regardless of computational 
difficulty so that you can win a TREC competition, then the techniques I 
mention above are useless.  But if you're trying to provide a more like 
this button on a search results page that does a decent job and has 
good performance, such techniques might be useful.

An efficient, effective more-like-this query generator would be a 
great contribution, if anyone's interested.  I'd imagine that it would 
take a Reader or a String (the document's text), an Analyzer, and return 
a set of representative terms using heuristics like those above.  The 
frequency and length thresholds could be parameters, etc.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Morus,

Unfortunately, using positive boost factors less than 1 causes the parser to
barf the same as do negative boost factors.

Regards,

Terry

- Original Message -
From: Morus Walter [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, January 21, 2004 10:54 AM
Subject: Re: Query Term Questions


 Erik Hatcher writes:
  
   TS==I've not been able to get negative boosting to work at all.
Maybe
   there's a problem with my syntax.
   If, for example, I do a search with green beret^10, it works just
   fine.
   But green beret^-2 gives me a
   ParseException showing a lexical error.
 
  Have you tried it without using QueryParser and boosting a Query using
  setBoost on it?  QueryParser is a double-edged sword and it looks like
  it only allows numeric characters (plus . followed by numeric
  characters).  So QueryParser has the problem with negative boosts, but
  not Query itself.

 He said he wants to have one term less important than others (at least
 that's what I understood).
 That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1)
 and might be called 'negative boosting' (such as breking is a form of
 negative acceleration).

 If you use negative boost factors you would even decrease the score of
 a match (not only increase it less) and risk of ending with a negative
 score. I don't think that would be a good idea.

 Morus

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Erik Hatcher
On Jan 21, 2004, at 4:21 PM, Terry Steichen wrote:
PS: Is this in the docs?  If not, maybe it should be mentioned.
Depends on what you consider the docs.  I looked at QueryParser.jj to 
see what it parses.

Also, on http://jakarta.apache.org/lucene/docs/queryparsersyntax.html 
it has an example of 0.2.

Documentation patches gladly accepted :))

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Tatu Saloranta
On Wednesday 21 January 2004 08:38, Doug Cutting wrote:
 Francesco Bellomi wrote:
  I agree that synchronization in Vector is a waste of time if it isn't
  required,

 It would be interesting to see if such synchronization actually impairs
 overall performance significantly.  This would be fairly simple to test.

True. At the same time, it's questionable whether there's any benefit of not 
changing it to ArrayList. However:


  but I'm not sure if LinkedList is a better (faster) choice than
  ArrayList.

 Correct.  ArrayList is the substitute for Vector.  One could also try
 replacing Hashtable with HashMap in many places.

Yes, LinkedList is pretty much never more or even as efficient (either memory 
or performancewise) than ArrayList. Arraycopy needed when doubling the size 
(which happens seldom enough when list grows) is neglible compared to 
increased GC activity and memory usage for entries in LinkedList (object 
overhead of 24 bytes for each entry, alloc/GC). 
And obviously indexed access is hideously slow, if that's needed. I've yet to 
find any use for LinkedList; it'd make sense to have some sort of combination 
(segmented array list, ie. linked list of arrays) for huge arrays... but 
LinkedList just isn't useful even there.

...
 My hunch is that the speedup will not be significant.  Synchronization
 costs in modern JVMs are very small when there is no contention.  But
 only measurement can say for sure.

Apparently 1.4 specifically had significant improvement there, reducing cost 
of synchronization.

-+ Tatu +-


 Doug


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



1.3-final: now giving me java.io.FileNotFoundException (Too many open files)

2004-01-21 Thread Matt Quail
I'm getting the following stack trace from lucene-1.3-final running on 
JDK 1.4.2_03-b02 on linux

java.io.FileNotFoundException: /home/matt/blah/idx/_123n.tis (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
at 
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:389)
at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:418)
at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:291)
at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:79)
at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:141)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:423)
at org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:401)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:260)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
at com.foo.Foo.perform(Foo.java:53)
I've only just upgraded to 1.3-final from 1.3-RC2, and now I've started 
seeing this error. I'll try and trace it down further, see if it is me 
leaking file handles, and not Lucene.

Any chance this is a Lucene bug?

=Matt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]