RE: Limiting Hits with a score threshold

2005-02-14 Thread Chuck Williams
I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches).  The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.  There are
various approaches to improving this that have been discussed (making
the scores more directly comparable by encoding additional information
into the score and using that for normalization, or probably better,
generalizing the score to an object that contains multiple pieces of
information; e.g. the total number of query terms matched by the top
result if you are using default OR would be quite useful).  None of
these ideas are implemented yet as far as I know.

Chuck

   -Original Message-
   From: Jay Hill [mailto:[EMAIL PROTECTED]
   Sent: Monday, February 14, 2005 11:08 AM
   To: lucene-user@jakarta.apache.org
   Subject: Limiting Hits with a score threshold
   
   Does anyone have an example of limiting results returned based on a
   score threshold? For example if I'm only interested in documents
with
   a score  0.05.
   
   Thanks,
   -Jay
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Similarity coord,lengthNorm

2005-02-07 Thread Chuck Williams
Hi Michael,

I'd suggest first using the explain() mechanism to figure out what's
going on.  Besides lengthNorm(), another factor that is likely skewing
your results in my experience is idf(), which Lucene typically makes
very large by squaring the intrinsic value.  I've found it helpful to
flatten lengthNorm(), tf() and idf() relative to what is used in
DefaultSimilarity.  There is a comparative evaluation of Similarity's
going on now.  You might consider looking at these:

Bug 32674 has a WikipediaSimilarity posted that you might want to try.
You might want to flatten lengthNorm() even further (e.g. all the way to
1.0), but I'd suggest trying it as is first.  If you try it, please post
your assessment.  Here's the link:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674

You also might find it interesting to read the thread entitled RE:
Scoring benchmark evaluation.  Was RE: How to proceed with Bug 31841 -
MultiSearcher problems with Similarity.docFreq() ? on lucene-dev, as
this contains a discussion of many of the issues.

Good luck,

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Monday, February 07, 2005 6:51 AM
   To: Lucene Users List
   Subject: Re: Similarity coord,lengthNorm
   
   
   On Feb 7, 2005, at 8:53 AM, Michael Celona wrote:
Would fixing the lengthNorm to 1 fix this problem?
   
   Yes, it would eliminate the length of a field as a factor.
   
   Your best bet is to set up a test harness where you can try out
various
   tweaks to Similarity, but setting the length normalization factor to
   1.0 may be all you need to do, as the coord() takes care of the
other
   factor you're after.
   
   Erik
   
   
Michael
   
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: Monday, February 07, 2005 8:48 AM
To: Lucene Users List
Subject: Similarity coord,lengthNorm
   
I have varying length text fields which I am searching on.  I
would
like
relevancy to be dictated predominantly by the number of terms in
my
query
that match.  Right now I am seeing a high relevancy for a single
word
matching in a small document even though all the terms in my query
don't
match.  Does, anyone have an example of a custom Similarity sub
class
which
overrides the coord and lengthNorm methods.
   
   
   
Thanks..
   
Michael
   
   
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
I think that depends on what you want to do.  The Lucene demo parser does 
simple mapping of HTML files into Lucene Documents; it does not give you a 
parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses the 
same API; will likely become part of Xerces), and so maps an HTML document into 
a full DOM that you can manipulate easily for a wide range of purposes.  I 
haven't used JTidy at an API level and so don't know it as well -- based on its 
UI, it appears to be focused primarily on HTML validation and error 
detection/correction.

I use CyberNeko for a range of operations on HTML documents that go beyond 
indexing them in Lucene, and really like it.  It has been robust for me so far.

Chuck

   -Original Message-
   From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, February 01, 2005 1:15 AM
   To: lucene-user@jakarta.apache.org
   Subject: which HTML parser is better?
   
   Three HTML parsers(Lucene web application
   demo,CyberNeko HTML Parser,JTidy) are mentioned in
   Lucene FAQ
   1.3.27.Which is the best?Can it filter tags that are
   auto-created by MS-word 'Save As HTML files' function?
   
   _
   Do You Yahoo!?
   150MP3
   http://music.yisou.com/
   
   http://image.yisou.com
   1G1000
   http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   il_1g/
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Chuck Williams
Like any other field, A.I. is only elusive until you master it.  There
are plenty of companies using A.I. techniques in various IR applications
successfully. LSI in particular has been around a long time and is well
understood.

Chuck

   -Original Message-
   From: jian chen [mailto:[EMAIL PROTECTED]
   Sent: Thursday, January 20, 2005 2:10 PM
   To: Lucene Users List
   Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture,
etc!
   
   Hi,
   
   One thing to point out. I think Lucene is not using LSI as the
   underlying retrieval model. It uses vector space model and also
   proximity based retrieval.
   
   Personally, I don't know much about LSI and I don't think the fancy
   stuff like LSI is workable in industry. I believe we are far away
from
   the era of artificial intelligence and using any elusive way to do
   information retrieval.
   
   Cheers,
   
   Jian
   
   
   On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore
[EMAIL PROTECTED]
   wrote:
Hi .. I'm new to the list so forgive a dumb question or two as I
get
started.
   
We're in the midst of converting a small collection (1200-1500
currently) of scientific literature to be easily
searchable/navigable.
We'll likely provide both a text query interface as well as a
   graphical
way to search and discover.
   
Our initial approach will be vector based, looking at Latent
Semantic
Indexing (LSI) as a potential tool, although if that's not needed,
we'll stop at reasonably simple stemming with a weighted document
term
matrix (DTM).  (Bear in mind I couldn't even pronounce most of
these
concepts last week, so go easy if I'm incoherent!)
   
It looks to me that Lucene has a quite well factored architecture.
I
should at the very least be able to use the analyzer and stemmer
to
create a good starting point in the project.  I'd also like to
leave a
nice architecture behind in case we or others end up experimenting
with, or extending, the system.
   
So a couple of questions:
   
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
apparently produces non-word stems .. i.e. not really human
readable.
(Example: generate, generates, generated, generating  - generat)
Although in typical queries this is not important because the
result
   of
the search is a document list, it *would* be important if we use
the
stems within a graphical navigation interface.
 So the question is: Is there a way to have the stemmer
produce
english
 base forms of the words being stemmed?
   
2 - We're probably using Lucene in ways it was not designed for,
such
as DTM/LSI and graphical clustering and navigation.  Naturally
we'll
provide code for these parts that are not in Lucene.
 But the question arises: is this kinda dumb?!  Has anyone
   stretched
Lucene's
 design center with positive results?  Are we barking up the
wrong
tree?
   
3 - A nit on hyphenation: Our collection is scientific so has many
hyphenated words.  I'm wondering about your experiences with
hyphenation.  In our collection, things like self-organization,
power-law, space-time, small-world, agent-based, etc. occur often,
for
example.
 So the question is: Do folks break up hyphenated words?  If
not,
   do
you stem the
 parts and glue them back together?  Do you apply stoplists to
the
parts?
   
Thanks for any help and pointers you can fling along,
   
Owenhttp://backspaces.net/http://redfish.com/
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: QUERYPARSIN BOOSTING

2005-01-12 Thread Chuck Williams
Google has natural results on the left and sponsored results on the
right.  I do not believe the natural results are affected by paid
keywords at all.  What you seem to be describing is the behavior of the
sponsored results, which I believe are explicitly attached to certain
keywords.

The same approach would work in Lucene.  Create a field to hold
purchased keywords (any keywords you want to associate with the
result).  Then you can include this field in your search with a high
boost (see DistributingMultiFieldQueryParser,
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674).

Google prefers certain results over others for certain keywords based on
various factors of the keyword purchase and the site (amount paid for
the keyword, Page Rank of the site, tenure of the listing, popularity of
the listing, etc.).  You could emulate this in various ways, using a
combination of document/field boosting and perhaps replication of the
term in the field (to increase its tf), or even perhaps multiple fields
that are boosted at different levels.  I'm not sure of the best approach
to this part -- you could experiment a little.

Chuck

   -Original Message-
   From: Karthik N S [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, January 12, 2005 2:30 AM
   To: Lucene Users List
   Subject: RE: QUERYPARSIN  BOOSTING
   
   Hi Guys
   
   Apologies...
   
   If somebody's is  been closely watching GOOGLE, It boost's WEBSITES
for
   payed category sites based on search words.
   
   Can This [ boost the Full WEBSITE ] be achieved in Lucene's search
   based on
   searchword
   
   If So Please Explain /examples  ???.
   
   with regards
   karthik
   
   
   
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, January 11, 2005 2:00 PM
   To: Lucene Users List; [EMAIL PROTECTED]
   Subject: RE: QUERYPARSIN  BOOSTING
   
   
   Karthik,
   
   I don't think the boost in your example does much since you are
using an
   AND query, i.e. all hits will have to contain both vendor:nike and
   contents:shoes.  If you used an OR, then the boost would put nike
   products above (non-nike) shoes, unless there was some other factor
that
   causes score of contents:shoes to be 10x greater than that of
   vendor:nike.  It's a good idea to look at the results of explain()
when
   analyzing what's happening with scoring, tuning your boosts and your
   Similarity.
   
   Chuck
   
  -Original Message-
  From: Nader Henein [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, January 11, 2005 12:21 AM
  To: Lucene Users List
  Subject: Re: QUERYPARSIN  BOOSTING
 
   From the text on the Lucene Jakarta Site :
  http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
 
 
  Lucene provides the relevance level of matching documents based
on
   the
  terms found. To boost a term use the caret, ^, symbol with a
boost
  factor (a number) at the end of the term you are searching. The
   higher
  the boost factor, the more relevant the term will be.
 
  Boosting allows you to control the relevance of a document
by
  boosting its term. For example, if you are searching for
 
 
 
 
  jakarta apache
 
 
 
 
  and you want the term jakarta to be more relevant boost it
   using
  the ^ symbol along with the boost factor next to the term.
You
   would
  type:
 
 
 
 
  jakarta^4 apache
 
 
 
 
  This will make documents with the term jakarta appear more
   relevant.
  You can also boost Phrase Terms as in the example:
 
 
 
 
  jakarta apache^4 jakarta lucene
 
 
 
 
  By default, the boost factor is 1. Although the boost factor
   must be
  positive, it can be less than 1 (e.g. 0.2)
 
 
  Regards.
 
  Nader Henein
 
 
  Karthik N S wrote:
 
  Hi Guys
  
  
  
  Apologies...
  
  This Question may be asked million times on this form ,need
some
  clarifications.
  
  1) FieldType =  keyword  name =  vendor
  
  2)FieldType =  text  name = contents
  
  Question:
  
  1) How to Construct a Query which would allow hits  avaliable
for
   the
  VENDOR
  to  appear  first ?.
  
  2) If boosting is to be applied How TO   ?.
  
  3) Is the Query Constructed Below correct?.
  
  +Contents:shoes +((vendor:nike)^10)
  
  
  
  Please Advise.
  Thx in advance.
  
  
  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]
  
  
  
 
  
-
  To unsubscribe, e-mail:
[EMAIL PROTECTED]
  For additional commands, e-mail:
   [EMAIL PROTECTED

RE: QUERYPARSIN BOOSTING

2005-01-11 Thread Chuck Williams
Karthik,

I don't think the boost in your example does much since you are using an
AND query, i.e. all hits will have to contain both vendor:nike and
contents:shoes.  If you used an OR, then the boost would put nike
products above (non-nike) shoes, unless there was some other factor that
causes score of contents:shoes to be 10x greater than that of
vendor:nike.  It's a good idea to look at the results of explain() when
analyzing what's happening with scoring, tuning your boosts and your
Similarity.

Chuck

   -Original Message-
   From: Nader Henein [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, January 11, 2005 12:21 AM
   To: Lucene Users List
   Subject: Re: QUERYPARSIN  BOOSTING
   
From the text on the Lucene Jakarta Site :
   http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
   
   
   Lucene provides the relevance level of matching documents based on
the
   terms found. To boost a term use the caret, ^, symbol with a boost
   factor (a number) at the end of the term you are searching. The
higher
   the boost factor, the more relevant the term will be.
   
   Boosting allows you to control the relevance of a document by
   boosting its term. For example, if you are searching for
   
   
   
   
   jakarta apache
   
   
   
   
   and you want the term jakarta to be more relevant boost it
using
   the ^ symbol along with the boost factor next to the term. You
would
   type:
   
   
   
   
   jakarta^4 apache
   
   
   
   
   This will make documents with the term jakarta appear more
relevant.
   You can also boost Phrase Terms as in the example:
   
   
   
   
   jakarta apache^4 jakarta lucene
   
   
   
   
   By default, the boost factor is 1. Although the boost factor
must be
   positive, it can be less than 1 (e.g. 0.2)
   
   
   Regards.
   
   Nader Henein
   
   
   Karthik N S wrote:
   
   Hi Guys
   
   
   
   Apologies...
   
   This Question may be asked million times on this form ,need some
   clarifications.
   
   1) FieldType =  keyword  name =  vendor
   
   2)FieldType =  text  name = contents
   
   Question:
   
   1) How to Construct a Query which would allow hits  avaliable for
the
   VENDOR
   to  appear  first ?.
   
   2) If boosting is to be applied How TO   ?.
   
   3) Is the Query Constructed Below correct?.
   
   +Contents:shoes +((vendor:nike)^10)
   
   
   
   Please Advise.
   Thx in advance.
   
   
   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SQL Distinct sintax in Lucen

2005-01-11 Thread Chuck Williams
If I understand what you are trying to do, you don't have a problem.
You can OR to your heart's content and Lucene will properly create the
union of the results.  I.e., there will be no duplicates.

There is built-in support for this kind of thing.  See
MultiFieldQueryParser, and for better results, consider
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674.

Chuck

   -Original Message-
   From: Carlos Franco Robles [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, January 11, 2005 2:05 PM
   To: lucene-user@jakarta.apache.org
   Subject: SQL Distinct sintax in Lucen
   
   Hi all.
   
   I'm starting to use lucene and I wonder if it is possible to make a
   query syntax to ask for one string which can be in two different
fields
   and filter duplicated results like with distinct in SQL syntax.
   Something like:
   
   distinct (+string OR OtherField:(+string))
   
   Thanks a lot
   
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Parsing issue

2005-01-04 Thread Chuck Williams
I use it and have yet to have a problem with it.  It uses the Xerces API
so you parse and access html files just like xml files.  Very cool,

Chuck

   -Original Message-
   From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, January 04, 2005 2:05 PM
   To: Lucene Users List
   Subject: Re: Parsing issue
   
   That's the correct place to look and it includes code samples.
   Yes, it's a Jar file that you add to the CLASSPATH and use ... hm,
   normally programmatically, yes :).
   
   Otis
   
   --- Hetan Shah [EMAIL PROTECTED] wrote:
   
Has any one used NekoHTML ? If so how do I use it. Is it a stand
alone
jar file that I include in my classpath and start using just like
IndexHTML ?
Can some one share syntax and or code if it is supposed to be used
programetically. I am looking at
http://www.apache.org/~andyc/neko/doc/html/ for more information
is
that
the correct place to look?
   
Thanks,
-H
   
   
Erik Hatcher wrote:
   
 Sure... clean up your HTML and it'll parse fine :)   Perhaps use
JTidy
 to clean up the HTML.  Or switch to using a more forgiving
parser
like
 NekoHTML.

 Erik

 On Jan 4, 2005, at 3:59 PM, Hetan Shah wrote:

 Hello All,

 Does any one know how to handle the following parsing error?

 thanks for pointers/code snippets.

 -H

 While trying to parse a HTML file using IndexHTML I get

 Parse Aborted: Encountered \ at line 8, column 1162.
 Was expecting one of:
 ArgName ...
 = ...
 TagEnd ...




   
-
 To unsubscribe, e-mail:
[EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]




   
-
 To unsubscribe, e-mail:
[EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]

   
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Asking Questions in a Search

2004-12-28 Thread Chuck Williams
Verity acquired Native Minds -- Verity Response appears to be that
technology.  It is not search technology at all -- rather it is a
programmed question-answer script knowledge base.  IMO, there are much
better commercial solutions to this problem; e.g., see www.inquira.com,
which integrates automated natural language search (i.e., finding
specific answers to natural language questions from within a text
corpus) with question/answer scripting capabilities.

I believe Lucene would be an excellent foundation for a system like
this, but it would need to be extended with a natural language query
parser / search-query generator and, if desired, some form of scripting
knowledge base.  Somebody may have gone down this path, but I'm not
aware of it.

Chuck

   -Original Message-
   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 28, 2004 7:52 PM
   To: lucene-user@jakarta.apache.org
   Subject: Asking Questions in a Search
   
   Hi
   
   Is it possible to do something like this with lucene:
   http://www.verity.com/products/response/index.html
   
   Thanks
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Poor Lucene Ranking for Short Text

2004-12-24 Thread Chuck Williams
I think you are confusing lengthNorm and the overall normalization of the 
score.  For overall normalization (prior to a final forced normalization in 
Hits), Lucene uses the formula you cite, except that it never sums td_d*idf_t, 
using instead tf_q*idf_t again, because the former is computationally 
intractable (changing even a single document changes the idf values, which 
means either that all document norms would have to be computed or that the sum 
over the document would need to happen at query time; the former is 
unacceptable for indexing time with large indices and the latter is 
unacceptable for query time with large documents).

lengthNorm is by default 1/sqrt(number_terms_in_document).  It is not 1.0f by 
default because 1.0f is in general not a good value; e.g., a single occurrence 
of a term in a 1meg document is not as significant as a single occurrence of 
the same term in a 1k document.  However, I find the default value to need 
additional damping because it affects the score too much, especially for small 
documents.  So, I use something like
   3.0f/log10(1000 + number_terms_in_document)

Chuck

   -Original Message-
   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
   Sent: Friday, December 24, 2004 8:24 AM
   To: 'Lucene Users List'
   Subject: AW: Poor Lucene Ranking for Short Text
   
   Hi Kevin,
   
   Seem like you have some knowledge about the lenghtNorm value in Lucene.
   Comparing it to the formula in Modern Information Retrieval does it
   sum up
   the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²))
   
   Just a quick note is ok.
   
   Besides that could you invite me to rojo. There beta status seem to be
   quite
   long.
   
   Thanks
   Michael
   
   | -Ursprüngliche Nachricht-
   | Von:
   | [EMAIL PROTECTED]
   | e.org
   | [mailto:[EMAIL PROTECTED]
   | ta.apache.org] Im Auftrag von Kevin A. Burton
   | Gesendet: Mittwoch, 27. Oktober 2004 22:48
   | An: Lucene Users List
   | Betreff: Re: Poor Lucene Ranking for Short Text
   |
   | Daniel Naber wrote:
   |
   |  (Kevin complains about shorter documents ranked higher)
   | 
   | This is something that can easily be fixed. Just use a Similarity
   | implementation that extends DefaultSimilarity and that overwrites
   | lengthNorm: just return 1.0f there. You need to use that
   | Similarity for
   | indexing and searching, i.e. it requires reindexing.
   | 
   | 
   | What happens when I do this with an existing index? I don't
   | want to have to rewrite this index as it will take FOREVER
   |
   | If the current behavior is all that happens this is fine...
   | this way I can just get this behavior for new documents that
   | are added.
   |
   | Also... why isn't this the default?
   |
   | Kevin
   |
   | --
   |
   | Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask
   | me for an invite!  Also see irc.freenode.net #rojo if you
   | want to chat.
   |
   | Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
   |
   | If you're interested in RSS, Weblogs, Social Networking,
   | etc... then you should work for Rojo!  If you recommend
   | someone and we hire them you'll get a free iPod!
   |
   | Kevin A. Burton, Location - San Francisco, CA
   |AIM/YIM - sfburtonator,  Web - http://peerfear.org/
   | GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
   |
   |
   | -
   | To unsubscribe, e-mail: [EMAIL PROTECTED]
   | For additional commands, e-mail: [EMAIL PROTECTED]
   |
   |
   
   
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: I though I understood, but obviously I missed something.

2004-12-24 Thread Chuck Williams
All of your Document.add's need to be doc.add's.  You are adding the
field to the document, not the class.

Chuck

   -Original Message-
   From: Jim Lynch [mailto:[EMAIL PROTECTED]
   Sent: Friday, December 24, 2004 8:30 AM
   To: Lucene Users List
   Subject: I though I understood, but obviously I missed something.
   
   A snippet from my program:
   
   Document doc = new Document();
   Field fContent = new
   Field(content,content.toString(),false,true,true);
   Field fTitle = new Field(title,title,true,true,true);
   Field fDate = new Field(date,date,true,true,false);
   Document.add(fContent);
   Document.add(fTitle);
   Document.add(fDate);
   
   Generate this (and other like it ) error
   
   method add(org.apache.lucene.document.Field) cannot be referenced
from a
   static context
   [javac] Document.add(fContent);
   
   Where did I go wrong?
   
   
   Thanks,
   Jim.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Relevance percentage

2004-12-23 Thread Chuck Williams
Gururaja,

If you want to score based solely on coord(), then Paul's approach looks
best.  However, based on your earlier messages, it looks to me like you
want to score based on all factors (with coord boosted as Paul
suggested, or lengthNorm flattened as I suggested -- either will get the
order you want in the example you posted), but you want to print the
(unboosted) coord percentage along with each result in the result list.

If this is the case, since the number of results per page on the result
list is presumably small, I think you are best off replicating the
explain() mechanism.  I don't have the source code, but you can look at
IndexSearcher.explain(), which recreates the weight with Query.weight(),
then calls what in this case will be
BooleanQuery.BooleanWeight.explain(), which has the code to recompute
coord on a result (specifically it computes overlap and maxoverlap and
then calls Similarity.coord()).  You could cut and paste this code to
just compute coord for your top-level BooleanQuery's.

Sorry I don't have source code to do this, but the approach should work.
Good luck,

Chuck

   -Original Message-
   From: Paul Elschot [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 22, 2004 11:59 PM
   To: lucene-user@jakarta.apache.org
   Subject: Re: Relevance percentage
   
   On Thursday 23 December 2004 08:13, Gururaja H wrote:
Hi Chuck Williams,
   
Thanks much for the reply.
   
If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then
divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full
   Lucene
query language, then you need to look at all the query types and
   decide
what exactly you want to compute (as coord is not always well-
   defined).
   
We are supporting full Lucene query language.
   
My request is, assuming queries are all BooleanQuery please
post the implementation source code for the same.  ie to calculate
the
   coord() method input parameters overlap and maxOverlap.
   
   I don't have the code, but I can give an overview of possible
   steps:
   
   First inherit from BooleanScorer to implement a score() method that
   returns only the coord() value (preferably a precomputed one).
   Then inherit from BooleanQuery.BooleanWeight to return the above
   Scorer.
   Then inherit from BooleanQuery to use the above Weight in
createWeight().
   Then inherit from QueryParser to use the above Query in
   getBooleanQuery().
   Finally use such a query in a search: the document scores will be
   the coord() values.
   
   Regards,
   Paul Elschot.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene index files from two different applications.

2004-12-21 Thread Chuck Williams
Depending on what you are doing, there are some problems with
MultiSearcher.   See
http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 for a
description of the issues and possible patch(es) to fix.

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 21, 2004 3:09 AM
   To: Lucene Users List
   Subject: Re: Lucene index files from two different applications.
   
   
   On Dec 21, 2004, at 5:51 AM, Gururaja H wrote:
1.  Can two applications write index files, in the same directory,
at
the same time ?
   
   If you mean to the same Lucene index, the answer is no.  Only a
single
   IndexWriter instance may be writing to an index at one time.
   
2.  If two applications cannot write index files, in the same
directory, at the same time.
 How should we resolve this ?  Would appriciate any solutions
to
this...
   
   You may consider writing a queuing system so that two applications
   queue up a document to index, and a single indexer application reads
   from the queue.  Or the applications could wait until the index is
   available for writing.  Or...
   
3.  My thought is to write the index files in two different
directories and read both the indexes
(as though it forms a single index, search results should consider
the
documents in both the indexes) from the WebApplication.  How to go
about implementing this, using Lucene API ?  Need inputs on which
of
the Lucene API's to use ?
   
   Lucene can easily search from multiple indexes using MultiSearcher.
   This merges the results together as you'd expect.
   
   Erik
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Relevance percentage

2004-12-20 Thread Chuck Williams
The coord() value is not saved anywhere so you would need to recompute
it.  You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord().  If your queries are all BooleanQuery's of
TermQuery's, then this is very simple.  Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses.  Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation).  If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).

I'm on the West Coast of the U.S. so evidently on a very different time
zone from you -- will look at your other message next.

Chuck

   -Original Message-
   From: Gururaja H [mailto:[EMAIL PROTECTED]
   Sent: Monday, December 20, 2004 6:10 AM
   To: Lucene Users List; Mike Snare
   Subject: Re: Relevance percentage
   
   Hi,
   
   But, How to calculate the coord() fraction ?  I know by default,
   in DefaultSimilarity the coord() fraction is defined as below:
   
   /** Implemented as codeoverlap / maxOverlap/code. */
   
   public float coord(int overlap, int maxOverlap) {
   
   return overlap / (float)maxOverlap;
   
   }
   How to get the overlap and maxOverlap value in each of the matched
   document(s) ?
   
   Thanks,
   Gururaja
   
   Mike Snare [EMAIL PROTECTED] wrote:
   I'm still new to Lucene, but wouldn't that be the coord()? My
   understanding is that the coord() is the fraction of the boolean
query
   that matched a given document.
   
   Again, I'm new, so somebody else will have to confirm or deny...
   
   -Mike
   
   
   On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
   wrote:
How to find out the percentages of matched terms in the
document(s)
   using Lucene ?
Here is an example, of what i am trying to do:
The search query has 5 terms(ibm, risc, tape, dirve, manual) and
there
   are 4 matching
documents with the following attributes:
Doc#1: contains terms(ibm,drive)
Doc#2: contains terms(ibm,risc, tape, drive)
Doc#3: contains terms(ibm,risc, tape,drive)
Doc#4: contains terms(ibm, risc, tape, drive, manual).
The percentages displayed would be 100%(Doc#4), 80%(doc#2),
80%(doc#3)
   and 40%
(doc#1).
   
Any help on how to go about doing this ?
   
Thanks,
Gururaja
   
   
-
Do you Yahoo!?
Send a seasonal email greeting and help others. Do good.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   -
   Do you Yahoo!?
All your favorites on one personal page - Try My Yahoo!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Relevance and ranking ...

2004-12-20 Thread Chuck Williams
I believe your sole problem is that you need to tone down your
lengthNorm.  Because doc4 is 10 times longer than doc2, its lengthNorm
is less than 1/3 of that of doc2 (1/sqrt(10) to be precise).  This is a
larger effect than the higher coord factor (1/.8) and the extra matching
term in doc4.

In your original description, it sounds like you want coord() to
dominate lengthNorm(), with lengthNorm() just being used as a
tie-breaker among queries with the same coord().

To achieve this, you need to reduce the impact of the lengthNorm()
differences, by changing the sqrt() function in the computation of
lengthNorm to something much flatter.  E.g., you might use:

  public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.log10(1000+numTerms));
  }

I'm not sure whether that specific formula will work, but you can find
one that will by adjusting the base of the logarithm and the additive
constant (1000 in the example).

Some general things:
  1.  You need to reindex when you change the Similarity (it is used for
indexing and searching -- e.g., the lengthNorm's are computed at index
time).
  2.  Be careful not to overtune your scoring for just one example.  Try
many examples.  You won't be able to get it perfect -- the idea is to
get close to your subjective judgments as frequently as possible.
  3.  The idea here is to find a value of lengthNorm() that doesn't
override coord, but still provides the tie-breaking you are looking for
(doc2 ahead of doc3).

Chuck

   -Original Message-
   From: Gururaja H [mailto:[EMAIL PROTECTED]
   Sent: Sunday, December 19, 2004 10:10 PM
   To: Lucene Users List
   Subject: RE: Relevance and ranking ...
   
   Chuck Williams,
   
   Thanks for the reply. Source code and Output are below.
   
   Please give me your inputs.
   
   Default document order i am getting is: Doc#2, Doc#4, Doc#3, Doc#1.
   Document order needed is: Doc#4, Doc#2, Doc#3, Doc#1.
   
   Let me know, if you need more information.
   
   NOTE: Using Luene Query object not BooleanQuery.
   
   Here is the source code:
   
   Searcher searcher = new IndexSearcher(index);
   
 Analyzer analyzer = new StandardAnalyzer();
 BufferedReader in = new BufferedReader(new
   InputStreamReader(System.in));
System.out.print(Query: );
String line = in.readLine();
Query query = QueryParser.parse(line, contents, analyzer);
System.out.println(Searching for:  + query.toString(contents));
Hits hits = searcher.search(query);
System.out.println(hits.length() +  total matching documents);
  for (int i = start; i  hits.length(); i++) {
Document doc = hits.doc(i);
  System.out.print(Score is: + hits.score(i));
  // Use whatever your fields are here:
  System.out.print(  title:);
  System.out.print(doc.get(title));
  System.out.print( description:);
  System.out.println(doc.get(description));
  // End of fields
  System.out.println(searcher.explain(query, hits.id(i)));
//System.out.println(Score of the document is:
+hits.score(i));
String path = doc.get(path);
if (path != null) {
 System.out.println(i + .  + path);
   System.out.println(--);
}
 ---
   
   
   Here is the output from the program:
   
   Query: ibm risc tape drive manual
   
   Searching for: ibm risc tape drive manual
   
   4 total matching documents
   
   Score is: 0.16266039 title:null description:null
   
   0.16266039 = product of:
   
   0.20332548 = sum of:
   
   0.03826245 = weight(contents:ibm in 1), product of:
   
   0.31521872 = queryWeight(contents:ibm), product of:
   
   0.7768564 = idf(docFreq=4)
   
   0.40576187 = queryNorm
   
   0.121383816 = fieldWeight(contents:ibm in 1), product of:
   
   1.0 = tf(termFreq(contents:ibm)=1)
   
   0.7768564 = idf(docFreq=4)
   
   0.15625 = fieldNorm(field=contents, doc=1)
   
   0.06340029 = weight(contents:risc in 1), product of:
   
   0.40576187 = queryWeight(contents:risc), product of:
   
   1.0 = idf(docFreq=3)
   
   0.40576187 = queryNorm
   
   0.15625 = fieldWeight(contents:risc in 1), product of:
   
   1.0 = tf(termFreq(contents:risc)=1)
   
   1.0 = idf(docFreq=3)
   
   0.15625 = fieldNorm(field=contents, doc=1)
   
   0.06340029 = weight(contents:tape in 1), product of:
   
   0.40576187 = queryWeight(contents:tape), product of:
   
   1.0 = idf(docFreq=3)
   
   0.40576187 = queryNorm
   
   0.15625 = fieldWeight(contents:tape in 1), product of:
   
   1.0 = tf(termFreq(contents:tape)=1)
   
   1.0 = idf(docFreq=3)
   
   0.15625 = fieldNorm(field=contents, doc=1)
   
   0.03826245 = weight(contents:drive in 1), product of:
   
   0.31521872 = queryWeight(contents:drive), product of:
   
   0.7768564 = idf(docFreq=4)
   
   0.40576187 = queryNorm
   
   0.121383816 = fieldWeight(contents:drive in 1), product of:
   
   1.0 = tf(termFreq(contents:drive)=1)
   
   0.7768564

RE: determination of matching hits

2004-12-20 Thread Chuck Williams
This is not the official recommendation, but I'd suggest you are least
consider:  http://issues.apache.org/bugzilla/show_bug.cgi?id=32674

If you're not using Java 1.5 and you decide you want to use it, you'd
need to take out those dependencies.  If you improve it, please share.

Chuck

   -Original Message-
   From: Christiaan Fluit [mailto:[EMAIL PROTECTED]
   Sent: Monday, December 20, 2004 2:51 PM
   To: Lucene Users List
   Subject: Re: determination of matching hits
   
   ok, I feel a bit stupid now ;) Turns out this issue has been
discussed a
   while ago on both mailing lists and I even participated in one of
   them... shame on me.
   
   The problem is indeed in how MFQP parses my query: the query A -B
   becomes:
   
   (text:A -text:B) (title:A -title:B) (path:A -path:B) (summary:A
   -summary:B) (agent:A -agent:B)
   
   whereas I intuitively expexted it to be evaluated as A in any field
and
   not B in any field. When I use a normal QueryParser and let it use
a
   single field only, everything works as expected.
   
   Browsing the lists archives I see that there were some efforts from
   different people in solving this issue, but I'm a bit confused about
the
   final outcome. Was this solved in the MFQP in 1.4.3? If not, what
   alternative implementation of MFPQ can I currently use best?
   
   
   Kind regards,
   
   Chris
   --
   
   Erik Hatcher wrote:
Christian,
   
Please simplify your situation.  Use a plain TermQuery for B and
see
what is returned.  Then use a simple BooleanQuery for A -B.  I
   suspect
MultiFieldQueryParser is the culprit.  What does the toString of
the
generated Query return?  MFQP is known to be trouble, and an
overhaul
   to
it has been contributed recently.
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Relevance and ranking ...

2004-12-18 Thread Chuck Williams
The coord is the fraction of clauses matched in a BooleanQuery, so with
your example of a 5-word BooleanQuery, the coord factors should be .4,
.8, .8, 1.0 respectively for doc1, doc2, doc3 and doc4.

One big issue you've got here is lengthNorm.  Doc2 is 1/10 the size of
doc4, so its lengthNorm is over 3x larger (sqrt(10)).  This more than
makes up for the difference in coord.  In your original post you
indicated a desire for a linear lengthNorm, which would actually make
this problem much worse.  You problem need to tone down the lengthNorm
instead (I turn mine off entirely, at least so far, by fixing it at 1.0;
this is not good in general, but got me past similar problems until I
can find a good formula).  You might try an inverse-log lengthNorm with
a high base (like the formula for idf I posted earlier).

The other thing that can bite you is the tf and idf computations.  E.g.,
if manual is a more common term than the others, this could cause the
tf*idf scores on doc2 to more than compensate for the difference in
coord, even if you set lengthNorm to be 1.0.

What is happening will be apparent from the explanations.  If you print
these out and post them, I'd be happy to suggest specific formulas.
Just use code like this:

  IndexSearcher searcher = new IndexSearcher(directory);
  System.out.println(query);
  Hits hits = searcher.search(query);
  for (int i=0; ihits.length(); i++) {
  Document doc = hits.doc(i);
  System.out.print(hits.score(i));
  // Use whatever your fields are here:
  System.out.print(  title:);
  System.out.print(doc.get(title));
  System.out.print( description:);
  System.out.println(doc.get(description));
  // End of fields
  System.out.println(searcher.explain(query, hits.id(i)));
  System.out.println(--);
  }

Chuck

   -Original Message-
   From: Gururaja H [mailto:[EMAIL PROTECTED]
   Sent: Saturday, December 18, 2004 4:56 AM
   To: Lucene Users List
   Subject: Re: Relevance and ranking ...
   
   Hi Erik,
   
   Created my own subclass of Similarity.  When i printed the values
for
   coord() factor
   i am getting the same for all the 4 documents.  So the value is NOT
   getting boosted.
   Want to do this. as i want the document that has
   e.g., all three terms in a three word query over those that contain
just
   two
   of the words.
   
   Please let me how do i go about doing this ?  Please explain the
   coordination factor ?
   
   The default order of document that i get for my example given in
this
   thread is as follows:
   Doc#2
   Doc#4
   Doc#3
   Doc#1
   
   Any inputs would be help full.  Thanks,
   
   Gururaja
   
   Erik Hatcher [EMAIL PROTECTED] wrote:
   
   On Dec 17, 2004, at 6:09 AM, Gururaja H wrote:
Thanks for the reply. Is there any sample code which tells me how
to
change these
coord() factor, overlapping, lenght normalizaiton etc.. ??
If there are any please provide me.
   
   Have a look at Lucene's DefaultSimilarity code itself. Use that as a
   starting point - in fact you should subclass it and only override
the
   one or two methods you want to tweak.
   
   There are probably some other examples in Lucene's test cases, or
that
   have been posted to the list but I don't have handy pointers to
them.
   
   Erik
   
   
   
Thanks,
Gururaja
   
   
Erik Hatcher wrote:
The coord() factor of Similarity is what controls a muliplier
factor
for overlapping query terms in a document. The DefaultSimilarity
already contains factors that allow documents with overlapping
terms
   to
get boosted. Is this not working for you? You may also need to
adjust
length normalization factors. Check the javadocs on Similarity for
details on implementing your own formulas. Also become familiar
with
IndexSearcher.explain() and the Explanation so that you can see
how
adjusting things affects the details.
   
Erik
   
On Dec 17, 2004, at 3:42 AM, Gururaja H wrote:
   
Hi,
   
How to implement the following ? Please provide inputs 
   
   
For example, if the search query has 5 terms (ibm, risc, tape,
drive,
manual) and there are 4 matching documents with the following
attributes, then the order should be as described below.
   
Doc#1: contains terms (ibm, drive) and has a total of 100 terms
in
   the
document.
   
Doc#2: contains terms (ibm, risc, tape, drive) and has a total of
30
terms in the document.
   
Doc#3: contains terms (ibm, risc, tape, drive) and has a total of
100
terms in the document.
   
Doc#4: contains terms (ibm, risc, tape, drive, manual) and has a
   total
of 300 terms in the document
   
The search results should include all three documents since each
has
one or more of the search terms, however, the order should be
   returned

RE: Relevance and ranking ...

2004-12-17 Thread Chuck Williams
Another issue will likely be the tf() and idf() computations.  I have a
similar desired relevance ranking and was not getting what I wanted due
to the idf() term dominating the score.  Lucene squares the contribution
of this term, which is not considered best practice in IR.  To address
these issues, I increased the base of the log for both tf() and idf()
(tones them down) and took a final square root on idf().  FYI, here are
the definitions I'm using for these methods -- similar definitions
should give you the ordering you want.  You might want to adjust
lengthNorm if you really want it to be linear (square root by default).
You should not have to touch coord().

public float tf(float freq) {
return 1.0f + (float)Math.log10(freq);
}

public float idf(int docFreq, int numDocs) {
  return (float)Math.sqrt(1.0 +
  Math.log10(numDocs/(double)(docFreq+1)));
}


Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Friday, December 17, 2004 4:06 AM
   To: Lucene Users List
   Subject: Re: Relevance and ranking ...
   
   
   On Dec 17, 2004, at 6:09 AM, Gururaja H wrote:
Thanks for the reply.  Is there any sample code which tells me how
to
change these
coord() factor,  overlapping, lenght normalizaiton etc.. ??
If there are any please provide me.
   
   Have a look at Lucene's DefaultSimilarity code itself.  Use that as
a
   starting point - in fact you should subclass it and only override
the
   one or two methods you want to tweak.
   
   There are probably some other examples in Lucene's test cases, or
that
   have been posted to the list but I don't have handy pointers to
them.
   
   Erik
   
   
   
Thanks,
Gururaja
   
   
Erik Hatcher [EMAIL PROTECTED] wrote:
The coord() factor of Similarity is what controls a muliplier
factor
for overlapping query terms in a document. The DefaultSimilarity
already contains factors that allow documents with overlapping
terms
   to
get boosted. Is this not working for you? You may also need to
adjust
length normalization factors. Check the javadocs on Similarity for
details on implementing your own formulas. Also become familiar
with
IndexSearcher.explain() and the Explanation so that you can see
how
adjusting things affects the details.
   
Erik
   
On Dec 17, 2004, at 3:42 AM, Gururaja H wrote:
   
Hi,
   
How to implement the following ? Please provide inputs 
   
   
For example, if the search query has 5 terms (ibm, risc, tape,
drive,
manual) and there are 4 matching documents with the following
attributes, then the order should be as described below.
   
Doc#1: contains terms (ibm, drive) and has a total of 100 terms
in
   the
document.
   
Doc#2: contains terms (ibm, risc, tape, drive) and has a total of
30
terms in the document.
   
Doc#3: contains terms (ibm, risc, tape, drive) and has a total of
100
terms in the document.
   
Doc#4: contains terms (ibm, risc, tape, drive, manual) and has a
   total
of 300 terms in the document
   
The search results should include all three documents since each
has
one or more of the search terms, however, the order should be
   returned
as:
   
Doc#4
   
Doc#2
   
Doc#3
   
Doc#1
   
Doc#4 should be first, since of the 5 search terms, it contains
all 5.
   
Doc#2 should be second, since it has 4 of the 5 search terms and
of
the number of terms in the document, its ratio is higher than
Doc#3
(4/30). Doc#3 has 4 of the 5 terms, but its ratio is 4/100.
   
Doc#1 is last since it only has 2 of the 5 terms.
   
   

   
Thanks,
Gururaja
   
   
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
-
Do you Yahoo!?
 Send holiday email and support a worthy cause. Do good.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing with Lucene 1.4.3

2004-12-16 Thread Chuck Williams
That looks right to me, assuming you have done an optimize.  All of your
index segments are merged into the one .cfs file (which is large,
right?).  Try searching -- it should work.

Chuck

   -Original Message-
   From: Hetan Shah [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 16, 2004 11:00 AM
   To: Lucene Users List
   Subject: Indexing with Lucene 1.4.3
   
   Hello,
   
   I have been trying to index around 6000 documents using IndexHTML
from
   1.4.3 and at the end of indexing in my index directory I only have 3
   files.
   segments
   deletable and
   _5en.cfs
   
   Can someone tell me what is going on and where are the actual index
   files? How can I resolve this issue?
   Thanks.
   -H
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: NUMERIC RANGE BOOLEAN

2004-12-16 Thread Chuck Williams
Karthik,

RangeQuery expands into a BooleanQuery containing all of the terms in
the index that fall within the range.  By default, BooleanQuery's can
have at most 1,024 terms.  So, if your index has more than 1,024
different prices that fall within your range then you will hit this
exception.  What matters is distinct prices, not multiple items.  E.g.,
it's ok to have 10,000 items at $5 -- that's just one price.  But more
than 1,024 distinct prices is a problem.

You can fix this at least a couple different ways.
  1.  Increase the maximum number of clauses allowed in a BooleanQuery
(see BooleanQuery.maxClauseCount).  Note that this is done at a cost of
performance.
  2.  Restructure your indexed prices and range query to reduce the
number of clauses.  E.g., index dollars and cents as two different
fields.  Then, for a range like $1.33 to $5.27, construct an or of 3
queries:
a.  $1 and [33 to 99 cents]
b.  [$2 to $5]
c.  $5 and [0 to 27 cents]

I don't know about RangeFilter, but look at QueryFilter.  You can use it
with a RangeQuery to implement a range filter.  However, I think you'll
hit the same issue, so Erik may be referring to a new mechanism that is
not in 1.4.3.

Chuck

   -Original Message-
   From: Karthik N S [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 16, 2004 9:38 PM
   To: Lucene Users List
   Subject: RE: NUMERIC RANGE BOOLEAN
   
   Hi Erik
   
   Apologies..
   
   
 Sometimes  I find it hard to understnad the Answer u  reply 
   
   
   
  1) I looked at the the Wiki and similarly padded '0' [ Total
Length =
   8 ]
   at the time of indexing
   
   so before Indexprocess the values will be   $ 10.25 , $ 0.50 ,$
   15.50.
   
   After padding and indexing finally [ Used Luke to moniter ] the
   values
   were 0010.25 ,.25,0015.50
   
   
 2) I did not find the RangeFilter API in Lucene1.4.3 [is it
recently
   added
   if so How Do I use the same some code snippets
please ]
   
   
   
   
   with regards
   Karthik
   
   
   
   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 16, 2004 6:55 PM
   To: Lucene Users List
   Subject: Re: NUMERIC RANGE BOOLEAN
   
   
   On Dec 16, 2004, at 7:17 AM, Karthik N S wrote:
We have to get the All the Hits int the Range ,
   
   So  0.99 cents IS  ALWAYS be 0.99 cents  on which we do the
price
Comaprison from consumer point of view .
   
   
I hope  I have answered u'r Question
   
   
   No, in fact, you have not.  If you want to continue to receive my
help
   here, you need to provide *details*.  You pose often ambiguous and
hard
   to decipher questions.  Please help us help you by answering the
   questions we ask precisely.  What are the values (exact string
values)
   in that field?  Please also read the wiki page on indexing numeric
   values.
   
   Look at using the new RangeFilter rather than a RangeQuery due to
the
   noted issues with doing a RangeQuery.
   
   Erik
   
   
   
   
With regards
Karthik
   
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 5:24 PM
To: Lucene Users List
Subject: Re: NUMERIC RANGE BOOLEAN
   
   
On Dec 16, 2004, at 5:03 AM, Morus Walter wrote:
Erik Hatcher writes:
   
TooManyClauses exception occurs when a query such as a
RangeQuery
expands to more than 1024 terms.  I don't see how this could be
the
case in the query you provided - are you certain that is the
query
that
generated the error?
   
Why not: the terms might be 0003 0003.1 0003.11 ...
   
So the question is, how do his terms look like...
   
Ah, good point! So, Karthik - what are are the values of those
terms?
   
Pragmatically, do you really need to do a range involving the
cents of
a price?
   
  Erik
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: NUMERIC RANGE BOOLEAN

2004-12-16 Thread Chuck Williams
Errata:
  b.  [$2 to 4]

Chuck

   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 16, 2004 9:58 PM
   To: Lucene Users List
   Subject: RE: NUMERIC RANGE BOOLEAN
   
   Karthik,
   
   RangeQuery expands into a BooleanQuery containing all of the terms
in
   the index that fall within the range.  By default, BooleanQuery's
can
   have at most 1,024 terms.  So, if your index has more than 1,024
   different prices that fall within your range then you will hit this
   exception.  What matters is distinct prices, not multiple items.
E.g.,
   it's ok to have 10,000 items at $5 -- that's just one price.  But
more
   than 1,024 distinct prices is a problem.
   
   You can fix this at least a couple different ways.
 1.  Increase the maximum number of clauses allowed in a
BooleanQuery
   (see BooleanQuery.maxClauseCount).  Note that this is done at a cost
of
   performance.
 2.  Restructure your indexed prices and range query to reduce the
   number of clauses.  E.g., index dollars and cents as two different
   fields.  Then, for a range like $1.33 to $5.27, construct an or of 3
   queries:
   a.  $1 and [33 to 99 cents]
   b.  [$2 to $5]
   c.  $5 and [0 to 27 cents]
   
   I don't know about RangeFilter, but look at QueryFilter.  You can
use it
   with a RangeQuery to implement a range filter.  However, I think
you'll
   hit the same issue, so Erik may be referring to a new mechanism that
is
   not in 1.4.3.
   
   Chuck
   
  -Original Message-
  From: Karthik N S [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 16, 2004 9:38 PM
  To: Lucene Users List
  Subject: RE: NUMERIC RANGE BOOLEAN
 
  Hi Erik
 
  Apologies..
 
 
Sometimes  I find it hard to understnad the Answer u  reply

 
 
 
 1) I looked at the the Wiki and similarly padded '0' [ Total
   Length =
  8 ]
  at the time of indexing
 
  so before Indexprocess the values will be   $ 10.25 , $ 0.50
,$
  15.50.
 
  After padding and indexing finally [ Used Luke to moniter ]
the
  values
  were 0010.25 ,.25,0015.50
 
 
2) I did not find the RangeFilter API in Lucene1.4.3 [is it
   recently
  added
  if so How Do I use the same some code snippets
   please ]
 
 
 
 
  with regards
  Karthik
 
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 16, 2004 6:55 PM
  To: Lucene Users List
  Subject: Re: NUMERIC RANGE BOOLEAN
 
 
  On Dec 16, 2004, at 7:17 AM, Karthik N S wrote:
   We have to get the All the Hits int the Range ,
  
  So  0.99 cents IS  ALWAYS be 0.99 cents  on which we do
the
   price
   Comaprison from consumer point of view .
  
  
   I hope  I have answered u'r Question
 
 
  No, in fact, you have not.  If you want to continue to receive
my
   help
  here, you need to provide *details*.  You pose often ambiguous
and
   hard
  to decipher questions.  Please help us help you by answering the
  questions we ask precisely.  What are the values (exact string
   values)
  in that field?  Please also read the wiki page on indexing
numeric
  values.
 
  Look at using the new RangeFilter rather than a RangeQuery due
to
   the
  noted issues with doing a RangeQuery.
 
  Erik
 
 
  
  
   With regards
   Karthik
  
   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 16, 2004 5:24 PM
   To: Lucene Users List
   Subject: Re: NUMERIC RANGE BOOLEAN
  
  
   On Dec 16, 2004, at 5:03 AM, Morus Walter wrote:
   Erik Hatcher writes:
  
   TooManyClauses exception occurs when a query such as a
   RangeQuery
   expands to more than 1024 terms.  I don't see how this could
be
   the
   case in the query you provided - are you certain that is the
   query
   that
   generated the error?
  
   Why not: the terms might be 0003 0003.1 0003.11
...
  
   So the question is, how do his terms look like...
  
   Ah, good point! So, Karthik - what are are the values of those
   terms?
  
   Pragmatically, do you really need to do a range involving the
   cents of
   a price?
  
 Erik
  
  
  
  
-
   To unsubscribe, e-mail:
[EMAIL PROTECTED]
   For additional commands, e-mail:
   [EMAIL PROTECTED]
  
  
  
  
-
   To unsubscribe, e-mail:
[EMAIL PROTECTED]
   For additional commands, e-mail:
   [EMAIL PROTECTED

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
I'll try to address all the comments here.

The normalization I proposed a while back on lucene-dev is specified.
Its properties can be analyzed, so there is no reason to guess about
them.

Re. Hoss's example and analysis, yes, I believe it can be demonstrated
that the proposed normalization would make certain absolute statements
like x and y meaningful.  However, it is not a panacea -- there would be
some limitations in these statements.

To see what could be said meaningfully, it is necessary to recall a
couple detailed aspects of the proposal:
  1.  The normalization would not change the ranking order or the ratios
among scores in a single result set from what they are now.  Only two
things change:  the query normalization constant, and the ad hoc final
normalization in Hits is eliminated because the scores are intrinsically
between 0 and 1.  Another way to look at this is that the sole purpose
of the normalization is to set the score of the highest-scoring result.
Once this score is set, all the other scores are determined since the
ratios of their scores to that of the top-scoring result do not change
from today.  Put simply, Hoss's explanation is correct.
  2.  There are multiple ways to normalize and achieve property 1.  One
simple approach is to set the top score based on the boost-weighted
percentage of query terms it matches (assuming, for simplicity, the
query is an OR-type BooleanQuery).  So if all boosts are the same, the
top score is the percentage of query terms matched.  If there are
boosts, then these cause the terms to have a corresponding relative
importance in the determination of this percentage.

More complex normalization schemes would go further and allow the tf's
and/or idf's to play a role in the determination of the top score -- I
didn't specify details here and am not sure how good a thing that would
be to do.  So, for now, let's just consider the properties of the simple
boost-weighted-query-term percentage normalization.

Hoss's example could be interpreted as single-term phrases Doug
Cutting and Chris Hostetter, or as two-term BooleanQuery's.
Considering both of these cases illustrates the absolute-statement
properties and limitations of the proposed normalization.

If single-term PhraseQuery's, then the top score will always be 1.0
assuming the phrase matches (while the other results have arbitrary
fractional scores based on the tfidf ratios as today).  If the queries
are BooleanQuery's with no boosts, then the top score would be 1.0 or
0.5 depending on whether 1 or two terms were matched.  This is
meaningful.

In Lucene today, the top score is not meaningful.  It will always be 1.0
if the highest intrinsic score is = 1.0.  I believe this could happen,
for example, in a two-term BooleanQuery that matches only one term (if
the tf on the matched document for that term is high enough).

So, to be concrete, a score of 1.0 with the proposed normalization
scheme would mean that all query terms are matched, while today a score
of 1.0 doesn't really tell you anything.  Certain absolute statements
can therefore be made with the new scheme.  This makes the
absolute-threshold monitored search application possible, along with the
segregating and filtering applications I've previously mentioned (call
out good results and filter out bad results by using absolute
thresholds).

These analyses are simplified by using only BooleanQuery's, but I
believe the properties carry over generally.

Doug also asked about research results.  I don't know of published
research on this topic, but I can again repeat an experience from
InQuira.  We found that end users benefited from a search experience
where good results were called out and bad results were downplayed or
filtered out.  And we managed to achieve this with absolute thresholding
through careful normalization (of a much more complex scoring
mechanism).  To get a better intuitive feel for this, think about you
react to a search where all the results suck, but there is no visual
indication of this that is any different from a search that returns
great results.

Otis raised the patch I submitted for MultiSearcher.  This addresses a
related problem, in that the current MultiSearcher does not rank results
equivalently to a single unified index -- specifically it fails Daniel
Naber's test case.  However, this is just a simple bug whose fix doesn't
require the new normalization.  I submitted a patch to fix that bug,
along with a caveat that I'm not sure the patch is complete, or even
consistent with the intentions of the author of this mechanism.

I'm glad to see this topic is generating some interest, and apologize if
anything I've said comes across as overly abrasive.  I use and really
like Lucene.  I put a lot of focus on creating a great experience for
the end user, and so am perhaps more concerned about quality of results
and certain UI aspects than most other users.

Chuck

   -Original Message-
   From: Doug Cutting [mailto:[EMAIL 

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
Nhan,

You are correct that dropping the document norm does cause Lucene's scoring 
model to deviate from the pure vector space model.  However, including norm_d 
would cause other problems -- e.g., with short queries, as are typical in 
reality, the resulting scores with norm_d would all be extremely small.  You 
are also correct that since norm_q is invariant, it does not affect relevance 
ranking.  Norm_q is simply part of the normalization of final scores.  There 
are many different formulas for scoring and relevance ranking in IR.  All of 
these have some intuitive justification, but in the end can only be evaluated 
empirically.  There is no correct formula.

I believe the biggest problem with Lucene's approach relative to the pure 
vector space model is that Lucene does not properly normalize.  The pure vector 
space model implements a cosine in the strictly positive sector of the 
coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and 
produces scores that can be compared across distinct queries (i.e., 0.8 means 
something about the result quality independent of the query).

Lucene does not have this property.  Its formula produces scores of arbitrary 
magnitude depending on the query.  The results cannot be compared meaningfully 
across queries; i.e., 0.8 means nothing intrinsically.  To keep final scores 
between 0 and 1, Lucene introduces an ad hoc query-dependent final 
normalization in Hits:  viz., it divides all scores by the highest score if the 
highest score happens to be greater than 1.  This makes it impossible for an 
application to properly inform its users about the quality of the results, to 
cut off bad results, etc.  Applications may do that, but in fact what they are 
doing is random, not what they think they are doing.

I've proposed a fix for this -- there was a long thread on Lucene-dev.  It is 
possible to revise Lucene's scoring to keep its efficiency, keep its current 
per-query relevance ranking, and yet intrinsically normalize its scores so that 
they are meaningful across queries.  I posted a fairly detailed spec of how to 
do this in the Lucene-dev thread.  I'm hoping to have time to build it and 
submit it as a proposed update to Lucene, but it is a large effort that would 
involve changing just about every scoring class in Lucene.  I'm not sure it 
would be incorporated even if I did it as that would take considerable work 
from a developer.  There doesn't seem to be much concern about these various 
scoring and relevancy ranking issues among the general Lucene community.

Chuck

   -Original Message-
   From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 15, 2004 1:18 AM
   To: Lucene Users List
   Subject: RE: A question about scoring function in Lucene
   
   Thank for your answer,
   In Lucene scoring function, they use only norm_q,
   but for one query, norm_q is the same for all
   documents.
   So norm_q is actually not effect the score.
   But norm_d is different, each document has a different
   norm_d; it effect the score of document d for query q.
   If you drop it, the score information is not correct
   anymore or it not space vector model anymore.  Could
   you explain it a little bit.
   
   I think that it's expensive to computed in incremetal
   indexing because when one document is added, idf of
   each term changed. But drop it is not a good choice.
   
   What is the role of norm_d_t ?
   Nhan.
   
   --- Chuck Williams [EMAIL PROTECTED] wrote:
   
Nhan,
   
Re.  your two differences:
   
1 is not a difference.  Norm_d and Norm_q are both
independent of t, so summing over t has no effect on
them.  I.e., Norm_d * Norm_q is constant wrt the
summation, so it doesn't matter if the sum is over
just the numerator or over the entire fraction, the
result is the same.
   
2 is a difference.  Lucene uses Norm_q instead of
Norm_d because Norm_d is too expensive to compute,
especially in the presence of incremental indexing.
E.g., adding or deleting any document changes the
idf's, so if Norm_d was used it would have to be
recomputed for ALL documents.  This is not feasible.
   
Another point you did not mention is that the idf
term is squared (in both of your formulas).  Salton,
the originator of the vector space model, dropped
one idf factor from his formula as it improved
results empirically.  More recent theoretical
justifications of tf*idf provide intuitive
explanations of why idf should only be included
linearly.  tf is best thought of as the real vector
entry, while idf is a weighting term on the
components of the inner product.  E.g., seen the
excellent paper by Robertson, Understanding inverse
document frequency: on theoretical arguments for
IDF, available here:
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
if you sign up for an eval.
   
It's easy to correct for idf^2

RE: A question about scoring function in Lucene

2004-12-14 Thread Chuck Williams
Nhan,

Re.  your two differences:

1 is not a difference.  Norm_d and Norm_q are both independent of t, so summing 
over t has no effect on them.  I.e., Norm_d * Norm_q is constant wrt the 
summation, so it doesn't matter if the sum is over just the numerator or over 
the entire fraction, the result is the same.

2 is a difference.  Lucene uses Norm_q instead of Norm_d because Norm_d is too 
expensive to compute, especially in the presence of incremental indexing.  
E.g., adding or deleting any document changes the idf's, so if Norm_d was used 
it would have to be recomputed for ALL documents.  This is not feasible.

Another point you did not mention is that the idf term is squared (in both of 
your formulas).  Salton, the originator of the vector space model, dropped one 
idf factor from his formula as it improved results empirically.  More recent 
theoretical justifications of tf*idf provide intuitive explanations of why idf 
should only be included linearly.  tf is best thought of as the real vector 
entry, while idf is a weighting term on the components of the inner product.  
E.g., seen the excellent paper by Robertson, Understanding inverse document 
frequency: on theoretical arguments for IDF, available here:  
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval.

It's easy to correct for idf^2 by using a customer Similarity that takes a 
final square root.

Chuck

   -Original Message-
   From: Vikas Gupta [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 14, 2004 9:32 PM
   To: Lucene Users List
   Subject: Re: A question about scoring function in Lucene
   
   Lucene uses the vector space model. To understand that:
   
   -Read section 2.1 of Space optimizations for Total Ranking paper
   (Linked
   here http://lucene.sourceforge.net/publications.html)
   -Read section 6 to 6.4 of
   http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
   -Read section 1 of
   http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
   
   Vikas
   
   On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
   
Hi all,
Lucene score document based on the correlation between
the query q and document t:
(this is raw function, I don't pay attention to the
boost_t, coord_q_d factor)
   
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t)  (*)
   
Could anybody explain it in detail ? Or are there any
papers, documents about this function ? Because:
   
I have also read the book: Modern Information
Retrieval, author: Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, Addison Wesley (Hope you have read it
too). In page 27, they also suggest a scoring funtion
for vector model based on the correlation between
query q and document d as follow (I use different
symbol):
   
   sum_t( weight_t_d * weight_t_q)
score_d(d, q)=  - (**)
norm_d * norm_q
   
where weight_t_d = tf_d * idf_t
  weight_t_q = tf_q * idf_t
  norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )
   
(**):  sum_t( tf_q*idf_t * tf_d*idf_t)
score_d(d, q)=-  (***)
 norm_d * norm_q
   
The two function, (*) and (***), have 2 differences:
1. in (***), the sum_t is just for the numerator but
in the (*), the sum_t is for everything. So, with
norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
calculated twice. Is this right? please explain.
   
2. No factor that define norms of the document: norm_d
in the function (*). Can you explain this. what is the
role of factor norm_d_t ?
   
One more question: could anybody give me documents,
papers that explain this function in detail. so when I
apply Lucene for my system, I can adapt the document,
and the field so that I still receive the correct
scoring information from Lucene .
   
Best regard,
Thanks every body,
   
=
Ð#7863;ng Nhân
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: A simple Query Language

2004-12-10 Thread Chuck Williams
You could support only terms with no operators at all, which will work
in most search engines (except those that require combining operators).
Using just terms and phrases embedded in 's is pretty universal.
After that, you might want to add +/- required/prohibited restrictions,
which many engines support.  After that, I think you're getting pretty
specific.  Lucene supports all of these and many more.

Chuck

   -Original Message-
   From: Dongling Ding [mailto:[EMAIL PROTECTED]
   Sent: Friday, December 10, 2004 5:08 PM
   To: Lucene Users List
   Subject: A simple Query Language
   
   Hi,
   
   
   
   I am going to implement a search service and plan to use Lucene. Is
   there any simple query language that is independent of any
particular
   search engine out there?
   
   
   
   Thanks
   
   
   
   
   
   Dongling
   
   
   
   
   
  

   
   If you have received this e-mail in error, please delete it and
notify
   the sender as soon as possible. The contents of this e-mail may be
   confidential and the unauthorized use, copying, or dissemination of
it
   and any attachments to it, is prohibited.
   
   Internet communications are not secure and Hyperion does not,
therefore,
   accept legal responsibility for the contents of this message nor for
any
   damage caused by viruses.  The views expressed here do not
necessarily
   represent those of Hyperion.
   
   For more information about Hyperion, please visit our Web site at
   www.hyperion.com
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Coordination value

2004-12-09 Thread Chuck Williams
There is an easier way.  You should use a custom Similarity, which
allows you to define your own coord() method.  Look at DefaultSimilarity
(which specializes Similarity).

I'd suggest analyzing your scores first with explain() to decide what
you really want to tweak.  Just a guess, but your issue might be that
your idf()'s are dominating the score computation.  I had this problem
and change the default idf() to take a final square root, since Lucene
squares that contribution (which is one of its few areas that is
generally not considered best practice).  I also boost the base of the
logarithms on both tf and idf to weight those factors lower.

Good luck,

Chuck

   -Original Message-
   From: Jason Haruska [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 09, 2004 1:36 PM
   To: Lucene Users List
   Subject: Coordination value
   
   I would like to adjust the score lucene is returning to use the
   coordination component more. For example, I have a BooleanQuery
   containing three TermQueries. I would like to adjust the score so
that
   documents containing all three terms appear first, followed by docs
   that contain only two of the terms, followed by documents that
contain
   only one of the terms.
   
   I understand that the coordination is a component of the overall
   document score currently, but I'd like to make it more absolute. I
was
   wondering if someone on the list has done something similar.
   
   I have implemented a hack that works by adding a function to the
   BooleanWeight class but it is very slow. I believe it is inefficient
   because it uses the Explanation class to get the coordination value.
   There must be an easier way that I'm missing.
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene Vs Ixiasoft

2004-12-08 Thread Chuck Williams
Lucene contains a complete set of Boolean query operators, and it uses
the vector space model to determine scores for relevance ranking.  It's
fast.  It works.

Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 08, 2004 7:13 PM
   To: Lucene Users List; Nicolas Maisonneuve
   Subject: Re: Lucene Vs Ixiasoft
   
   I thought Lucene implements the Boolean model.
   
   -John
   
   
   On Thu, 9 Dec 2004 00:19:21 +0100, Nicolas Maisonneuve
   [EMAIL PROTECTED] wrote:
hi,
think first of the relevance of the model in this 2 search engine
for
XML document retrieval.
   
Lucene is classic fulltext search engine  using the vector space
model. this model is efficient for indexing  no structred document
(like plain text file ) and not made for structured document like
XML.
there is a XML demo of lucene sandbox but it's not really very
efficient because it doesn't take advantage of  the document
strucutre
in the indexing and the ranking model, so it lose semantic
information
and relevance.
   
i don't know Ixiasoft, check the information to see how it index
and
rank XML document.
   
nicolas
   
On Wed, 8 Dec 2004 14:20:45 -0500, Praveen Peddi
   
   
[EMAIL PROTECTED] wrote:
 Does anyone know about Ixiasoft server. Its a xml
repository/search
   engine. If anyone knows about it, does he/she also know how it is
   compared to Lucene? Which is fast?

 Praveen
 **
 Praveen Peddi
 Sr Software Engg, Context Media, Inc.
 email:[EMAIL PROTECTED]
 Tel:  401.854.3475
 Fax:  401.861.3596
 web: http://www.contextmedia.com
 **
 Context Media- The Leader in Enterprise Content Integration


   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene

2004-12-07 Thread Chuck Williams
Since it's untokenized, are you searching with the exact string stored
in the field?

Chuck

   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 07, 2004 3:29 PM
   To: 'Lucene Users List'; 'Chris Fraschetti'
   Subject: RE: Sorting in Lucene
   
   I also tried searching the said field on LIMO and I don't get a
match.
   
   Thanks,
   Ramon
   
   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 07, 2004 3:20 PM
   To: 'Lucene Users List'; 'Chris Fraschetti'
   Subject: RE: Sorting in Lucene
   
   Hi,
   
   I use LIMO to look into my index. Limo tells me that the field is
   untokenized but is indexed.
   
   Is it possible to search on untokenized field?
   
   Thanks,
   Ramon
   
   -Original Message-
   From: Chris Fraschetti [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 07, 2004 3:14 PM
   To: Lucene Users List
   Subject: Re: Sorting in Lucene
   
   I would try 'luke' to look at your index and use it's search
   functionality to make sure it's now your code that is the problem,
as
   well as to ensure  your document is appearing in the index as you
   intend it. It's been a lifesaver for me.
   
   http://www.getopt.org/luke/
   
   
   On Tue, 7 Dec 2004 15:02:26 -0800, Ramon Aseniero
   [EMAIL PROTECTED] wrote:
Hi All,
   
Any idea why a Keyword field is not searchable? On my index I have
a
   field
of type Keyword but I could not somehow search on the field.
   
Thanks in advance.
   
Ramon
   
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.289 / Virus Database: 265.4.7 - Release Date:
12/7/2004
   
   
   
   
   --
   ___
   Chris Fraschetti
   e [EMAIL PROTECTED]
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   --
   No virus found in this incoming message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004
   
   
   --
   No virus found in this outgoing message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   --
   No virus found in this incoming message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004
   
   
   --
   No virus found in this outgoing message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene

2004-12-07 Thread Chuck Williams
Ramon,

Field.Keyword is definitely searchable.  I use them.  I think I use
every combination of tokenized/untokenized, index/unindexed, and
stored/unstored.  They all work.

This seems unlikely given that you tried with Luke, but do you perhaps
have an analyzer applied to the query so that the query string is
transformed before it is applied to the index?  I'd suggest printing the
query after you parse it.  Query's have a good toString() method.

Chuck

   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 07, 2004 4:14 PM
   To: 'Lucene Users List'
   Subject: RE: Sorting in Lucene
   
   Hi Chuck,
   
   Yes I tried to search with the exact string stored on the index but
I
   don't
   get a match. I tried the search using LIMO and LUKE.
   
   It seems like untokenized field are not searchable.
   
   Thanks,
   Ramon
   
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 07, 2004 4:04 PM
   To: Lucene Users List
   Subject: RE: Sorting in Lucene
   
   Since it's untokenized, are you searching with the exact string
stored
   in the field?
   
   Chuck
   
  -Original Message-
  From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, December 07, 2004 3:29 PM
  To: 'Lucene Users List'; 'Chris Fraschetti'
  Subject: RE: Sorting in Lucene
 
  I also tried searching the said field on LIMO and I don't get a
   match.
 
  Thanks,
  Ramon
 
  -Original Message-
  From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, December 07, 2004 3:20 PM
  To: 'Lucene Users List'; 'Chris Fraschetti'
  Subject: RE: Sorting in Lucene
 
  Hi,
 
  I use LIMO to look into my index. Limo tells me that the field
is
  untokenized but is indexed.
 
  Is it possible to search on untokenized field?
 
  Thanks,
  Ramon
 
  -Original Message-
  From: Chris Fraschetti [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, December 07, 2004 3:14 PM
  To: Lucene Users List
  Subject: Re: Sorting in Lucene
 
  I would try 'luke' to look at your index and use it's search
  functionality to make sure it's now your code that is the
problem,
   as
  well as to ensure  your document is appearing in the index as
you
  intend it. It's been a lifesaver for me.
 
  http://www.getopt.org/luke/
 
 
  On Tue, 7 Dec 2004 15:02:26 -0800, Ramon Aseniero
  [EMAIL PROTECTED] wrote:
   Hi All,
  
   Any idea why a Keyword field is not searchable? On my index I
have
   a
  field
   of type Keyword but I could not somehow search on the field.
  
   Thanks in advance.
  
   Ramon
  
   --
   No virus found in this outgoing message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date:
   12/7/2004
  
  
 
 
  --
  ___
  Chris Fraschetti
  e [EMAIL PROTECTED]
 
 
  
-
  To unsubscribe, e-mail:
[EMAIL PROTECTED]
  For additional commands, e-mail:
[EMAIL PROTECTED]
 
 
 
  --
  No virus found in this incoming message.
  Checked by AVG Anti-Virus.
  Version: 7.0.289 / Virus Database: 265.4.7 - Release Date:
12/7/2004
 
 
  --
  No virus found in this outgoing message.
  Checked by AVG Anti-Virus.
  Version: 7.0.289 / Virus Database: 265.4.7 - Release Date:
12/7/2004
 
 
 
 
  
-
  To unsubscribe, e-mail:
[EMAIL PROTECTED]
  For additional commands, e-mail:
[EMAIL PROTECTED]
 
 
 
  --
  No virus found in this incoming message.
  Checked by AVG Anti-Virus.
  Version: 7.0.289 / Virus Database: 265.4.7 - Release Date:
12/7/2004
 
 
  --
  No virus found in this outgoing message.
  Checked by AVG Anti-Virus.
  Version: 7.0.289 / Virus Database: 265.4.7 - Release Date:
12/7/2004
 
 
 
 
  
-
  To unsubscribe, e-mail:
[EMAIL PROTECTED]
  For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   --
   No virus found in this incoming message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004
   
   
   --
   No virus found in this outgoing message.
   Checked by AVG Anti-Virus.
   Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004

Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-03 Thread Chuck Williams
I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs
people have found to yield the best performance for different
configurations.  Is there a repository of this information anywhere?

 

I've got about 30k documents and have 3 indexing scenarios:

1.   Full indexing and optimize

2.   Incremental indexing and optimize

3.   Parallel incremental indexing without optimize

 

Search performance is critical.  For both cases 1 and 2, I'd like the
fastest possible indexing time.  For case 3, I'd like minimal pauses and
no noticeable degradation in search performance.

 

Based on reading the code (including the javadocs comments), I'm
thinking of values along these lines:

 

mergeFactor:  1000 during Full indexing, and during optimize (for both
cases 1 and 2); 10 during incremental indexing (cases 2 and 3)

minMergeDocs:  1000 during Full indexing, 10 during incremental indexing

maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
incremental indexing

 

Do these values seem reasonable?  Are there better settings before I
start experimenting?

 

Since mergeFactor is used in both addDocument() and optimize(), I'm
thinking of using two different values in case 2:  10 during the
incremental indexing, and then 1000 during the optimize.  Is changing
the value like this going to cause a problem?


Thanks for any advice,

 

Chuck

 

 



RE: Search multiple Fields

2004-12-02 Thread Chuck Williams
If you want this to be efficient in your application, I'd suggest
integrating at a lower level.  E.g., take a look at TermScorer.explain()
to see how it determines whether or not a term matches in a field of
document.

Another approach might be to specialize BooleanQuery to keep track of
which clauses matched.

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Thursday, December 02, 2004 12:13 PM
   To: Lucene Users List
   Subject: Re: Search multiple Fields
   
   On Dec 2, 2004, at 11:43 AM, Eric Louvard wrote:
I'm searching, for example
   
title:world OR contents:world OR author:world
   
Is it possible to know where (in which Field) have Lucene found
'world' in each Document,
without making 3 queries ?
   
   Not in a straightforward way, but you can dig through the
Explanation
   returned from IndexSearcher.explain() to see what factors are
involved
   in the score, which does include info on what fields/terms were
   matched.
   
   Erik
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: boosting challenge

2004-11-29 Thread Chuck Williams
Try the explain() capability to see what factors are influencing the
order of your results.  Probably these other factors are overwhelming
your boost.  I had similar problems and resolved them by tweaking these
other contributions, especially idf.  You can do that in a custom
Similarity.

Chuck

   -Original Message-
   From: Frank Morton [mailto:[EMAIL PROTECTED]
   Sent: Monday, November 29, 2004 12:49 PM
   To: Lucene Users List
   Subject: Re: boosting challenge
   
   Thanks for the response.  Using 4.0 did not work either.
   
   Additionally, I have also tried Field.setBoost(4.0) on the name
   field. That didn't work either.
   
   Still perplexedI assume people are using boosting with 1.4
   successfully.
   
   
   On Nov 29, 2004, at 3:36 PM, Otis Gospodnetic wrote:
   
Try 4.0 instead of 4.  That may be correct syntax (don't have
QueryParser source to check), because the code takes boosts as
float
type values.
   
Otis
   
--- Frank Morton [EMAIL PROTECTED] wrote:
   
I have an index of restaurants with two fields. The name of the
restaurant and a description.
   
I would like to search for the word bob in both fields, but if
it
occurs in the name, it would score higher. So, if Bob Evans
is
the
name of the restaurant, but other restaurants refer to Bob in
the
description, the restaurant Bob Evans would score highest, but
the
others would also match the query.
   
I thought you could boost the term with a query like:
   
name:bob^4 description:bob
   
and it would boost the word bob if found in the name property,
but
this is not working for me. I get  the exact same results using
the
above query and a simple bob query.
   
I am using lucene-1.4-final.jar.
   
I am using the PorterStemAnalyzer
   
Am I missing something. Lucene seems very capable,  otherwise.
   
Thanks.
   
   
   
-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: modifying existing index

2004-11-24 Thread Chuck Williams
I haven't tried it but believe this should work:

IndexReader reader;
void delete(long id) {
reader.delete(new Term(id, Long.toString(id)));
}

This also has the benefit that it does binary search rather than
sequential search.

You will want to pad you id's with leading zeroes if you are going to do
incremental indexing (both when storing them and when looking them up).
Sorting is by lexicographic order, not numerical order, and incremental
indexing is much faster if the id's are kept sorted (as is done in
IndexHTML).

Chuck


   -Original Message-
   From: Santosh [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 9:54 AM
   To: Lucene Users List
   Subject: Re: modifying existing index
   
   I am able to delete now the Index using the following
   
   if(indexDir.exists())
   
   {
   
   
   IndexReader reader = IndexReader.open( indexDir );
   
   uidIter = reader.terms(new Term(id, ));
   
   while (uidIter.term() != null  uidIter.term().field() == id) {
   
   
   reader.delete(uidIter.term());
   
   uidIter.next();
   
   }
   
   reader.close();
   
   }
   
   where id  is the keyword field. But here also all the documents
are
   deleted. How can I modify my code and delete particular document
with
   given
   id
   
   
   
   
   
   Iam creating the index in the following way
   
   Document doc = new Document();
   
   doc.add(Field.Text(text,text));
   
   doc.add(Field.Keyword(id,Long.toString(id)));
   
   doc.add(Field.Keyword(title,title));
   
   doc.add(Field.Keyword(keywords,keywords));
   
   doc.add(Field.Keyword(type,type));
   
   writer.addDocument(doc);
   
   
   
   
   
   
   
   
   
   - Original Message -
   From: Chuck Williams [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 1:06 PM
   Subject: RE: modifying existing index
   
   
   A good way to do this is to add a keyword field with whatever unique
id
   you have for the document.  Then you can delete the term containing
a
   unique id to delete the document from the index (look at
   IndexReader.delete(Term)).  You can look at the demo class IndexHTML
to
   see how it does incremental indexing for an example.
   
   Chuck
   
  -Original Message- From: Santosh
   [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004
   11:34
   PM To: Lucene Users List Subject: Re: modifying existing index 
I
   have
   gon through IndexReader , I got method : delete(int docNum)
,
   but
   from where I will get document number? Is  this predifined? or
   we have to give a number prior  to indexing? - Original
   Message - From: Luke Francl [EMAIL PROTECTED] To:
   Lucene
   Users List [EMAIL PROTECTED] Sent: Wednesday,
November
   24,
   2004 1:26 AM Subject: Re: modifying existing indexOn Tue,
   2004-11-23 at 13:59, Santosh wrote:   I am using lucene for
indexing,
   when I am creating Index the docuemnts are added. but when I want
to
   modify the single existing document
   and reIndex again, it is taking as new document and adding one more
   time, so that I am getting same document twice in the results.  
To
   overcome this I am deleting existing Index and again
   recreating whole Index. but is it possibe to index  the modified
   document
   again and overwrite existing document without deleting and
recreation.
   can
   I do this? If
   so how?   You do not need to recreate the whole index. Just
mark
   the
   document as  deleted using the IndexReader and then add it again
with
   the
IndexWriter. Remember to close your IndexReader and IndexWriter
   after  doing this.   The deleted document will be removed the
next
   time you optimize
   your  index.   Luke Francl   
  
- 
   To
   unsubscribe, e-mail: [EMAIL PROTECTED] 
For
   additional commands, e-mail:
   [EMAIL PROTECTED]   
  
-
   To
   unsubscribe, e-mail: [EMAIL PROTECTED] For
   additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-24 Thread Chuck Williams
Does keyIter return the keys in sorted order?  This should reduce seeks,
especially if the keys are dense.

Also, you should be able to localReader.delete(term) instead of
iterating over the docs (of which I presume there is only one doc since
keys are unique).  This won't improve performance as
IndexReader.delete(Term) does exactly what your code does, but it will
be cleaner.

A linear slowdown with number of docs doesn't make sense, so something
else must be wrong.  I'm not sure what the default buffer size is (it
appears it used to be 128 but is dynamic now I think).  You might find
the slowdown stops after a certain point, especially if you increase
your batch size.

Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 12:21 PM
   To: Lucene Users List
   Subject: Re: URGENT: Help indexing large document set
   
   Thanks Paul!
   
   Using your suggestion, I have changed the update check code to use
   only the indexReader:
   
   try {
 localReader = IndexReader.open(path);
   
 while (keyIter.hasNext()) {
   key = (String) keyIter.next();
   term = new Term(key, key);
   TermDocs tDocs = localReader.termDocs(term);
   if (tDocs != null) {
 try {
   while (tDocs.next()) {
 localReader.delete(tDocs.doc());
   }
 } finally {
   tDocs.close();
 }
   }
 }
   } finally {
   
 if (localReader != null) {
   localReader.close();
 }
   
   }
   
   
   Unfortunately it didn't seem to make any dramatic difference.
   
   I also see the CPU is only 30-50% busy, so I am guessing it's
spending
   a lot of time in IO. Anyway of making the CPU work harder?
   
   Is batch size of 500 too small for 1 million documents?
   
   Currently I am seeing a linear speed degredation of 0.3 milliseconds
   per document.
   
   Thanks
   
   -John
   
   
   On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot
   [EMAIL PROTECTED] wrote:
On Wednesday 24 November 2004 00:37, John Wang wrote:
   
   
 Hi:

I am trying to index 1M documents, with batches of 500
documents.

Each document has an unique text key, which is added as a
 Field.KeyWord(name,value).

For each batch of 500, I need to make sure I am not adding a
 document with a key that is already in the current index.

   To do this, I am calling IndexSearcher.docFreq for each
document
   and
 delete the document currently in the index with the same key:

while (keyIter.hasNext()) {
 String objectID = (String) keyIter.next();
 term = new Term(key, objectID);
 int count = localSearcher.docFreq(term);
   
To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.
   
Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.
   
Regards,
Paul Elschot
   
   
-
   
   
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-23 Thread Chuck Williams
Are you sure you have a performance problem with
TermInfosReader.get(Term)?  It looks to me like it scans sequentially
only within a small buffer window (of size
SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
See TermInfosReader.getIndexOffset(Term).

Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 3:38 PM
   To: [EMAIL PROTECTED]
   Subject: URGENT: Help indexing large document set
   
   Hi:
   
  I am trying to index 1M documents, with batches of 500 documents.
   
  Each document has an unique text key, which is added as a
   Field.KeyWord(name,value).
   
  For each batch of 500, I need to make sure I am not adding a
   document with a key that is already in the current index.
   
 To do this, I am calling IndexSearcher.docFreq for each document
and
   delete the document currently in the index with the same key:
   
  while (keyIter.hasNext()) {
   String objectID = (String) keyIter.next();
   term = new Term(key, objectID);
   int count = localSearcher.docFreq(term);
   
   if (count != 0) {
   localReader.delete(term);
   }
 }
   
   Then I proceed with adding the documents.
   
   This turns out to be extremely expensive, I looked into the code and
I
   see in
   TermInfosReader.get(Term term) it is doing a linear look up for each
   term. So as the index grows, the above operation degrades at a
linear
   rate. So for each commit, we are doing a docFreq for 500 documents.
   
   I also tried to create a BooleanQuery composed of 500 TermQueries
and
   do 1 search for each batch, and the performance didn't get better.
And
   if the batch size increases to say 50,000, creating a BooleanQuery
   composed of 50,000 TermQuery instances may introduce huge memory
   costs.
   
   Is there a better way to do this?
   
   Can TermInfosReader.get(Term term) be optimized to do a binary
lookup
   instead of a linear walk? Of course that depends on whether the
terms
   are stored in sorted order, are they?
   
   This is very urgent, thanks in advance for all your help.
   
   -John
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene Scorers

2004-11-23 Thread Chuck Williams
Hi Ken,

I'm glad our replies were helpful.  It sounds like you looked at the
code in MaxDisjunctionQuery, so you probably noticed that it also
implements skipTo().  Your suggestion sounds like a good thing to do.  I
thought about that when writing MaxDisjunctionQuery, but didn't need the
generality, and it does make the code more complex.  I think Lucene
needs one of these mechanisms in it, at least to solve the problems
associated with the current default use of BooleanQuery for multiple
field expansions.  Your proposal would generalize this to solve
additional cases where different accrual operators are appropriate.

You could write and submit the generalization, although there are no
guarantees anybody would do anything with it.  I didn't get anywhere in
my attempt to submit MaxDisjunctionQuery.  I think there is also a
serious problem in scoring with the current score normalization (it does
not provide meaningfully comaparable scores across different searches,
which means that absolute score numbers like 0.8 have no intrinsic
meaning concerning how good a result is or is not).  When I finally get
back to tuning search in my app, that's the next one I'll try a
submission on.

Chuck

   -Original Message-
   From: Ken McCracken [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 4:31 PM
   To: Lucene Users List
   Subject: Re: lucene Scorers
   
   Hi,
   
   Thanks the pointers in your replies.  Would it be possible to
include
   some sort of accrual scorer interface somewhere in the Lucene Query
   APIs?  This could be passed into a query similar to
   MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
   according to the implementor's discretion, to compute the overall
   score for a document.
   
   -Ken
   
   On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot
   [EMAIL PROTECTED] wrote:
On Friday 12 November 2004 22:56, Chuck Williams wrote:
   
   
 I had a similar need and wrote MaxDisjunctionQuery and
 MaxDisjunctionScorer.  Unfortunately these are not available as
a
   patch
 but I've included the original message below that has the code
   (modulo
 line breaks added by simple text email format).

 This code is functional -- I use it in my app.  It is optimized
for
   its
 stated use, which involves a small number of clauses.  You'd
want to
 improve the incremental sorting (e.g., using the bucket
technique of
 BooleanQuery) if you need it for large numbers of clauses.
   
When you're interested, you can also have a look here for
yet another DisjunctionScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
   
It has the advantage that it implements skipTo() so that it can
be used as a subscorer of ConjunctionScorer, ie. it can be
faster in situations like this:
   
aa AND (bb OR cc)
   
where bb and cc are treated by the DisjunctionScorer.
When aa is a filter this can also be used to implement
a filtering query.
   
   
   
   
 Re. Paul's suggested steps below, I did not integrate this with
   query
 parser as I didn't need that functionality (since I'm generating
the
 multi-field expansions for which max is a much better scoring
choice
 than sum).

 Chuck

 Included message:

 -Original Message-
 From: Chuck Williams [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 11, 2004 9:55 PM
 To: [EMAIL PROTECTED]
 Subject: Contribution: better multi-field searching

 The files included below (MaxDisjunctionQuery.java and
 MaxDisjunctionScorer.java) provide a new mechanism for searching
   across
 multiple fields.
   
The maximum indeed works well, also when the fields differ a lot
   length.
   
Regards,
Paul
   
   
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: modifying existing index

2004-11-23 Thread Chuck Williams
A good way to do this is to add a keyword field with whatever unique id
you have for the document.  Then you can delete the term containing a
unique id to delete the document from the index (look at
IndexReader.delete(Term)).  You can look at the demo class IndexHTML to
see how it does incremental indexing for an example.

Chuck

   -Original Message-
   From: Santosh [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 11:34 PM
   To: Lucene Users List
   Subject: Re: modifying existing index
   
   I have gon through IndexReader , I got method : delete(int
   docNum)   ,
   but from where I will get document number? Is  this predifined? or
we
   have
   to give a number prior  to indexing?
   - Original Message -
   From: Luke Francl [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 1:26 AM
   Subject: Re: modifying existing index
   
   
On Tue, 2004-11-23 at 13:59, Santosh wrote:
 I am using lucene for indexing, when I am creating Index the
   docuemnts
   are added. but when I want to modify the single existing document
and
   reIndex again, it is taking as new document and adding one more
time, so
   that I am getting same document twice in the results.
 To overcome this I am deleting existing Index and again
recreating
   whole
   Index. but is it possibe to index  the modified document again and
   overwrite
   existing document without deleting and recreation. can I do this? If
so
   how?
   
You do not need to recreate the whole index. Just mark the
document as
deleted using the IndexReader and then add it again with the
IndexWriter. Remember to close your IndexReader and IndexWriter
after
doing this.
   
The deleted document will be removed the next time you optimize
your
index.
   
Luke Francl
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: fetching similar wordlist as given word

2004-11-23 Thread Chuck Williams
Lucene does support stemming, but that is not what your example requires
(stemming equates roaming, roam, roamed, etc.).  For stemming,
look at PorterStemFilter or better, the Snowball stemmers in the
sandbox.  For your similar word list, I think you are looking for the
class FuzzyTermEnum.  This should give you the terms you need, although
perhaps only those with a common prefix of a specified length.
Otherwise, you could develop your own algorithm to look for similar
terms in the index.

Chuck

   -Original Message-
   From: Santosh [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 11:15 PM
   To: Lucene Users List
   Subject: fetching similar wordlist as given word
   
   can lucene will be able to do stemming?
   If I am searching for roam then I know that it can give result for
   foam using fuzzy query. But my requirement is if I search for
roam
   can I get the similar wordlist as output. so that I can show the end
   user in the column  ---   do you mean foam?
   How can I get similar word list in the given content?
   
   
   
   ---SOFTPRO
DISCLAIMER--
   
   
   
   Information contained in this E-MAIL and any attachments are
   
   confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
   
   and 'confidential'.
   
   
   
   If you are not an intended or authorised recipient of this E-MAIL or
   
   have received it in error, You are notified that any use, copying or
   
   dissemination  of the information contained in this E-MAIL in any
   
   manner whatsoever is strictly prohibited. Please delete it
immediately
   
   and notify the sender by E-MAIL.
   
   
   
   In such a case reading, reproducing, printing or further
dissemination
   
   of this E-MAIL is strictly prohibited and may be unlawful.
   
   
   
   SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
   
   hereto is free from computer viruses or other defects.
   
   
   
   The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
   
   those of the author and are not necessarily those of SOFTPRO
SYSTEMS.
   
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Question about multi-searching [re-post]

2004-11-22 Thread Chuck Williams
If you are going to compare scores across multiple indices, I'd suggest
considering one of the patches here:

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Chuck

   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Monday, November 22, 2004 6:30 AM
   To: Lucene Users List
   Subject: Re: Question about multi-searching [re-post]
   
   
   On Nov 22, 2004, at 9:18 AM, Cocula Remi wrote:
(First of all : what is the plurial of index in english ; indexes
or
indices ?)
   
   We used indexes in Lucene in Action.  Its a bit ambiguous in
English,
   but indexes sounds less formal and is acceptable.
   
For that, I parse a new query using QueryParser or
MultiFieldQueryParser.
Then I search my indexes using the MultiSearcher class.
   
Ok, but the problem comes when different analyzer are used for
each
index.
QueryParser requires an analyzer to parse the query but a query
parsed with an analyzer is not suitable for searching into an
index
that uses another analyzer.
   
Does anyone know a trick to cope this problem.
   
   Nothing built into Lucene solves this problem specifically.  You'll
   have to come up with your own MultiSearcher-like facility that can
   apply different queries to different indexes and merge the results
back
   together.  This will be awkward when it comes to scoring though,
since
   each index is using a different query.
   
Eventually I could run a different query on each index to obtain
several Hits objects.
Then I could write some collector that collects Hits in the order
of
highest scores.
I wonder if this could work and if it would be as efficient as
the
MultiSearcher . In this situation does it make sense to compare
the
scores of two differents Hits.
   
   No, it won't make good sense to compare the scores between the
queries,
   but I suspect our queries are pretty close to one another if all
that
   varies is the analyzer.  It still will be an awkward comparison
though,
   but maybe good enough for your needs?
   
   Erik
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Need help with filtering

2004-11-22 Thread Chuck Williams
It sounds like you need to pad your numbers with leading zeroes, i.e.
use the same type of encoding as is required by RangeQuery's.  If you
query with 05 instead of 5 do you get what you expect?  If all your
document id's are fixed length, then string comparison will be
isomorphic to integer comparison.

Chuck

   -Original Message-
   From: Edwin Tang [mailto:[EMAIL PROTECTED]
   Sent: Monday, November 22, 2004 10:34 AM
   To: Lucene Users List
   Subject: Re: Need help with filtering
   
   Hello again,
   
   I've modified DateFilter to filter out document IDs as suggested.
All
   seems to
   be running well until I tried a specific test case. All my documents
   have IDs
   in the 400,000 range. If I set my lower limit to 5, nothing comes
back.
   After
   examining the code, I found the issue to be at the following line:
   TermEnum enumerator = reader.terms(new Term(field, start));
   
   Is there a way to retrieve a set of documents with IDs using a
Integer
   comparison versus a String comparison? If I set start to 0, I get
   everything,
   but that's not very efficient.
   
   Thanks in advance,
   Ed
   
   --- Paul Elschot [EMAIL PROTECTED] wrote:
   
On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
 Hello,

 I have been using DateFilter to limit my search results to a
certain
   date
 range. I am now asked to replace this filter with one where my
   search
results
 have document IDs greater than a given document ID. This
document ID
   is
 assigned during indexing and is a Keyword field.

 I've browsed around the FAQs and archives and see that I can
either
   use
 QueryFilter or BooleanQuery. I've tried both approaches to limit
the
document
 ID range, but am getting the BooleanQuery.TooManyClauses
exception
   in both
 cases. I've also tried bumping max number of clauses via
setMaxClauseCount(),
 but that number has gotten pretty big.

 Is there another approach to this? ...
   
Recoding DateFilter to a DocumentIdFilter should be
straightforward.
   
The trick is to use only one document enumerator at a time for all
terms. Document enumerators take buffer space, and that is the
reason why BooleanQuery has an exception for too many clauses.
   
Regards,
Paul
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   __
   Do You Yahoo!?
   Tired of spam?  Yahoo! Mail has the best spam protection around
   http://mail.yahoo.com
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene - index fields design question

2004-11-16 Thread Chuck Williams
I do most of these same things and made these relevant design decisions:

1.  Use a combination of query expansion to search across multiple
fields and field concatenation to create document fields that combine
separate object fields.  I use multiple fields only when it is important
to weight them differently.  E.g., in my case the separate fields are
combined into just title and body document fields for general term
searching.  I expand queries (with my own expander after parsing) by
rewriting queries against the default field into an OR across title
and body with title boosted higher than body.
2.  One problem with the above concerns scoring (and this is also one of
the reasons to use concatenation rather than query expansion as much as
possible).  Lucene's BooleanQuery use sum-based scoring for OR's that is
further factored with the coord() adjustment (settable in the
Similarity).  This causes OR's to behave very poorly for the
field-expansion case.  E.g., if the query is foo bar, and  you expand
each term into title and body in the simplest way to produce title:foo^4
body:foo title:bar^4 body:bar, then a document with foo in title and bar
in body will get the same score as one with foo in title and foo in
body, clearly not desired.  There are at least 3 different solutions to
this problem discussed on this list.  I wrote my own MaxDisjunctionQuery
just to handle this case:  it uses max instead of sum for this kind of
OR query, and it does not use coord() (so use MaxDisjunctionQuery for
the OR's of the same term or other query across multiple fields, and
regular BooleanQuery to OR together the different terms or other
queries).  Paul Elschot wrote a more general DisjunctionQuery that can
be configured to do the same thing.  Doug Cutting came up with a
solution that does not require a new Query class; his solution expands
the query in a certain way and specializes certain existing methods.
You should be able to find these solutions by searching the archive
(e.g., search for MaxDisjunctionQuery and DisjunctionQuery and read the
threads).  Code is posted in one way or other.
3.  RangeQuery's are the way to do your date ranges, or any other
ranges.  The encodings need to be lexicographic, not integer.  E.g., 10
precedes 2, so pad with leading 0's (02 precedes 10).  If you need
negatives or floats, you need additional considerations to ensure
consistency with lexicographic order (invert the order of negatives and
use a sign representation such that the positive sign indicator follows
the negative sign indicator; floats require nothing special so long as
the integer portion is fixed length).  Dates encode naturally.  I add
additional fields like those used to search Ranges onto the Lucene
documents in addition to title and body.  There are numerous messages on
the list that discuss details of this, and there is a link to the web
site that goes through a complete example, including showing how to
specialize the query parser if you want users entering RangeQuery's in
Lucene syntax (either way you have to lexicographically encode both
queries and the document fields you index).

If you have more specific questions or cannot find the references,
please just ask.

Good luck,

Chuck


   -Original Message-
   From: Venkatraju [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 16, 2004 9:51 AM
   To: lucene-user
   Subject: Lucene - index fields design question
   
   Hi,
   
   I am a new user of Lucene. so please point me to
   documentation/archives if these issues have been covered before.
   
   I plan to use Lucene in a application with the following (fairly
   standard) requirements:
   - Index documents that contain a title, author, date and content
   - It is fairly common to search for some text across all the fields
   - Matches in the title field should be given more weightage over
   matches in the content field
   - Provide an option to restrict search to documents within a date
range
   
   Give these requirements, what is a good index design with search
speed
   in mind?
   Documents will have fields title, author, date and content.
   Should I make title and author part of the content as well so that
   search across all fields will just become a search in content
   field? If so, how do I give more weightage to matches in title
   field?
   
   The other option would be to expand a simple query to include
searches
   across all fields.
   Ex.: Expand abcd to title:abcd^4 OR content:abcd. Also, should
the
   boost for title field be applied in the query or is it better to
   provide a boost to the title field during indexing (is that
possible)?
   Which of these options will work and be more effecient?
   
   For date range limited search, can field values be integers? If not,
   encoding the date as MMDDHHMM and then use a filter or a
   RangeQuery - is that the way to do this?
   
   Thanks,
   Venkat
   
  
-
   To 

RE: setting Similarity at search time

2004-11-15 Thread Chuck Williams
Take a look at this:

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Not my initial patch, but the latest patch from Wolf Siberski.  I
haven't used it yet, but it looks like what you are looking for, and
something I want to use too.

Chuck

   -Original Message-
   From: Ken McCracken [mailto:[EMAIL PROTECTED]
   Sent: Monday, November 15, 2004 11:31 AM
   To: Lucene Users List
   Subject: setting Similarity at search time
   
   Hi,
   
   Is there a way to set the Similarity at search(...) time, rather
than
   just setting it on the (Index)Searcher object itself?  I'd like to
be
   able to specify different similarities in different threads
searching
   concurrently, using the same IndexSearcher instance.
   
   In my use case, the choice of Similarity is a parameter of the
search
   request, and hence may be different for each request.
   
   Can such a method be added to override the search(...) method?
   
   Thanks,
   -Ken
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chuck Williams
My Lucene application includes multi-faceted navigation that does a more
complex version of the below.  I've got 5 different taxonomies into
which every indexed item is classified.  The largest of the taxonomies
has over 15,000 entries while the other 4 are much smaller. For every
search query, I determine the best small set of nodes from each taxonomy
to present to the user as drill down options, and provide the counts
regarding how many results fall under each of these nodes.  At present I
only have about 25,000 indexed objects and usually no more than 1,000
results from the initial query.  To determine the drill-down options and
counts, I scan up to 1,000 results computing the counts for all nodes
into which these results classify.  Then for each taxonomy I pick the
best drill-down options available (orthogonal set with reasonable
branching factor that covers all results) and present them with their
counts.  If there are more than 1,000 results, I extrapolate the
computed counts to estimate the actual counts on the entire set of
results.  This is all done with a single index and a single search.

The total time required for performing this computation for the one
large taxonomy is under 10ms, running in full debug mode in my ide.  The
query response time overall is subjectively instantaneous at the UI
(Google-speed or better).  So, unless some dimension of the problem is
much bigger than mine, I doubt performance will be an issue.

Chuck

   -Original Message-
   From: Nader Henein [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 13, 2004 2:29 AM
   To: Lucene Users List
   Subject: Re: How to efficiently get # of search results, per
attribute
   
   It depends on how many results they're looking through, here are two
   scenarios I see:
   
   1] If you don't have that many records you can fetch all the results
and
   then do a post parsing step the determine totals
   
   2] If you have a lot of entries in each category and you're worried
   about fetching thousands of records every time, you can just have
   seperate indecies per category and search them in in parallel (not
   Lucene Parallel Search) and you can get up to 100 hits for each one
   (efficiency) but you'll also have the total from the search to
display.
   
   Either way you can boost up speed using RamDirectory if you need
more
   speed from the search, but whichever approach you choose I would
   recommend that you sit down and do some number crunching to figure
out
   which way to go.
   
   
   Hope this helps
   
   Nader Henein
   
   
   
   Chris Lamprecht wrote:
   
   I'd like to implement a search across several types of entities,
   let's say, classes, professors, and departments.  I want the user
to
   be able to enter a simple, single query and not have to specify
what
   they're looking for.  Then I want the search results to be
something
   like this:
   
   Search results for: philosophy boyer
   
   Found: 121 classes - 5 professors - 2 departments
   
   search results here...
   
   
   I know I could iterate through every hit returned and count them up
   myself, but that seems inefficient if there are lots of results.
Is
   there some other way to get this kind of information from the
search
   result set?  My other ideas are: doing a separate search each
result
   type, or storing different types in different indexes.  Any
   suggestions?  Thanks for your help!
   
   -Chris
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Anyone implemented custom hit ranking?

2004-11-13 Thread Chuck Williams
I've done some customization of scoring/ranking and plan to do more.  A
good place to start is with your own Similarity, extending Lucene's
DefaultSimilarity.  Like you, I found the default length normalization
to not work well with my dataset.  I separately weight each indexed
field according to a static relative importance (implemented as a query
boost factor that is automatically applied) and then disable length
normalization altogether by redefining lengthNorm() to always return
1.0f.

I also had problems with tf and idf normalization, especially with idf
dominating the ranking determination.  To address that, my Similarity
increases the base of the log for each, and adds a final square root to
the idf computation since Lucene squares the idf in the score
computations.

Have you tried the explain() mechanism?  It is a great way to see
precisely how your results are being scored (but be warned there is a
final normalization in Hits that explain() does not show -- this final
normalization does not affect the ranking order, but it does affect the
final scores).

Chuck

   -Original Message-
   From: Sanyi [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 13, 2004 12:38 AM
   To: [EMAIL PROTECTED]
   Subject: Anyone implemented custom hit ranking?
   
   Hi!
   
   I have problems with short text ranking. I've read about same raking
   problems in the list
   archives, but found only hints and toughts (adjust
DefaultSimilarity,
   Similarity, etc...), not
   complete solutions with source code.
   Anyone implemented a good solution for this problem? (example: my
search
   application returns about
   10-20 pages of 1-2 word hits for hello, and then it starts to list
the
   longer texts)
   I've implemented a very simple solution: I boost documents shorter
than
   300 chars with
   1/300*doclength at index time. Now it works a lot better. In fact, I
   can't see any problems now.
   Anyway, I think this is not the solution, this is a patch or
   workaround.
   So, I'd be interested in some kind of well designed complete
solution
   for this problem.
   
   Regards,
   Sanyi
   
   
   
   __
   Do you Yahoo!?
   Check out the new Yahoo! Front Page.
   www.yahoo.com
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene Scorers

2004-11-12 Thread Chuck Williams
I had a similar need and wrote MaxDisjunctionQuery and
MaxDisjunctionScorer.  Unfortunately these are not available as a patch
but I've included the original message below that has the code (modulo
line breaks added by simple text email format).

This code is functional -- I use it in my app.  It is optimized for its
stated use, which involves a small number of clauses.  You'd want to
improve the incremental sorting (e.g., using the bucket technique of
BooleanQuery) if you need it for large numbers of clauses.

Re. Paul's suggested steps below, I did not integrate this with query
parser as I didn't need that functionality (since I'm generating the
multi-field expansions for which max is a much better scoring choice
than sum).

Chuck

Included message:

-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED] 
Sent: Monday, October 11, 2004 9:55 PM
To: [EMAIL PROTECTED]
Subject: Contribution: better multi-field searching

The files included below (MaxDisjunctionQuery.java and
MaxDisjunctionScorer.java) provide a new mechanism for searching across
multiple fields.

The issue is this.  Imagine you have two fields, title and document,
both of which you want to search with simple queries like:  albino
elephant.  There are two general approaches, either a) create a combined
field that concatenates the two individual fields, or b) expand the
simple query into a BooleanQuery that searches for each term in both
fields.

With approach a), you lose the flexibility to set separate boost factors
on the individual fields.  I wanted title to be much more important than
description for ranking results, and wanted to control this explicitly,
as length norm was not always doing the right thing; e.g., descriptions
are not always long.

With approach b) you run into another problem.  Suppose the example
query is expanded into (title:albino description:albino title:elephant
description:elephant).  Then, assuming tf/idf doesn't affect ranking, a
document with albino in both title and description will score the same
as a document with albino in title and elephant in description.  The
latter document for most applications is much better since it matches
both query terms.  If albino is the more important term according to
idf, then the less desirable documents (albino in both fields) will rank
consistently ahead of the albino elephants (which is what was happening
to me, yielding horrible results).

MaxDisjunctionQuery solves this problem.  The MaxDisjunctionQuery pretty
prints as:  (q1 | q2 | ... | qn)~tiebreaker

The qi's are any subqueries.  This generates the same results as an
OR-type BooleanQuery but scores them differently.  The score for any
document d is the maximum value of the score that d receives for any
subquery, plus the tiebreaker times the sum of the scores it receives
for any other retrieving subqueries.  In the simplest case, tiebreaker
is 0.0f, and the score is simply the maximum score for any retrieving
subquery.  If tiebreaker is nonzero, it should be much smaller than the
boosts being used (0.1 is working very well for me  with title boost at
4.0 and description boost at 1.0).

With this mechanism, the albino elephant query is expanded like this:

( (title^4.0:albino | description:albino)~0.1
  (title^4.0:elephant | description:elephant)~0.1
)

I.e., a BooleanQuery is used to cover the distinct terms, while a
MaxDisjunctionQuery is used to expand the fields.

This query has the following properties:
  1.  Documents with two distinct terms score higher than documents with
the same term in the two different fields.
  2.  Documents that contain a title match for a term score higher than
documents containing only a description match for the same term.
  3.  If two documents contain the same query terms, and yet one of them
contains one of the query terms in multiple fields while the other does
not, the document containing the term in multiple fields scores higher
(this is the purpose of the tiebreaker -- it breaks ties among documents
that match the same terms in the same highest-scoring fields).

Sorry if this is redundant, but I didn't find anything in Lucene already
to do this.  It has helped me considerably, so I'd like to submit it in
case others are facing the same issues.

As an aside, is there a reason that idf is squared in each Term and
Phrase match (it is multiplied both into the query component and the
field component)?  To compensate for this, I'm taking the square root of
the idf I really want in my Similarity, which seems strange.

Thanks for any info on that and any feedback on the utility of
MaxDisjunctionQuery.

NOTE:  The java files use generics and so require the 1.5 jdk, although
it would be straightforward to back-port them to earlier jdk's.

Chuck Williams

*** MaxDisjunctionQuery.java

/*
 * MaxDisjunctionQuery.java
 *
 * Created on October 9, 2004, 3:17 PM
 */

package org.apache.lucene.search;

import java.io.IOException

RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-05 Thread Chuck Williams
Thanks Daniel and Justin for the suggestions!  I have a fix and will
record my experience here for the benefit of anybody else facing this
problem:

1.  .cvsignore did not work.  CVS may ignore the Lucene index directory,
but it still insists on creating the CVS subdirectory of the index
directory.

2.  I didn't try the suggestion of defining an alias module with a CVS
directory exclude (!) restriction.  This might have worked had I limited
all my CVS operations to just work with the alias module, but this would
limit flexibility and remove a lot of the nice CVS integration features
in the Netbeans ide.

3.  Bernhard's patch solves the problem!  I had a couple minor glitches
installing it.  First, there is a missing throws IOException
declaration on the list(FileFilter) method he has added.  Second, the
patch is based on a newer version of FSDirectory than the version in
1.4.2, so my attempt to apply the patch automatically failed.  Applying
the patch manually and adding the throws declaration fixed all problems.

I would like to suggest that Bernhard's patch be integrated into the
next version of Lucene.

Chuck

   -Original Message-
   From: Daniel Naber [mailto:[EMAIL PROTECTED]
   Sent: Friday, November 05, 2004 10:00 AM
   To: Lucene Users List
   Subject: Re: Is there an easy way to have indexing ignore a CVS
   subdirectory in the index directory?
   
   On Friday 05 November 2004 18:03, Chuck Williams wrote:
   
The Lucene index is not in CVS -- neither the directory nor the
files.
But it is a subdirectory of a directory that is in CVS,
   
   Does this patch help?
   http://issues.apache.org/bugzilla/show_bug.cgi?id=31747
   
   --
   http://www.danielnaber.de
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-04 Thread Chuck Williams
Otis, thanks for looking at this.  The stack trace of the exception is
below.  I looked at the code.  It wants to delete every file in the
index directory, but fails to delete the CVS subdirectory entry
(presumably because it is marked read-only; the specific exception is
swallowed).  Even if it could delete the CVS subdirectory, this would
just cause another problem with Netbeans/CVS, since it wouldn't know how
to fix up the pointers in the parent CVS subdirectory.  Is there a
change I could make that would cause it to safely leave this alone?

This problem only arises on a full index (incremental == false =
create == true).  Incremental indexes work fine in my app.

Chuck

java.io.IOException: Cannot delete CVS
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
at [my app]...

   -Original Message-
   From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 1:54 PM
   To: Lucene Users List
   Subject: Re: Is there an easy way to have indexing ignore a CVS
   subdirectory in the index directory?
   
   Hm, as far as I know, a CVS sub-directory in an index directory
should
   not bother Lucene.  As a matter of fact, I tested this (I used a
file,
   not a directory) for Lucene in Action.  What error are you getting?
   
   I know there is -I CVS option for ignoring files; perhaps it works
with
   directories, too.
   
   Otis
   
   
   --- Chuck Williams [EMAIL PROTECTED] wrote:
   
I have a Tomcat web module being developed with Netbeans 4.0 ide
using
CVS.  One CVS repository holds the sources of my various web files
in
a
directory structure that directly parallels the standard Tomcat
webapp
directory structure.  This is well supported in a fully automated
way
within Netbeans.  I have my search index directory as a
subdirectory
of
WEB-INF, which seemed the natural place to put it.  The index
files
themselves are not in the repository.  I want to be able to do CVS
Update for the web module directory tree as a whole.  However,
this
places a CVS subdirectory within the index directory, which in
turn
causes Lucene indexing to blow up the next time I run it since
this
is
an unexpected entry in the index directory.  To make things works,
to
work around the problem I both need to delete the CVS subdirectory
and
find and delete the pointers to it in the Entries file and
Netbeans
cache file within the CVS subdirectory of the parent directory.
This
is
annoying to say the least.
   
   
   
I've asked the Netbeans users if there is a way to avoid creation
of
the
index's CVS subdirectory, but the same thing happened using WinCVS
and I
so I expect this is not a Netbeans issue.  It could be my relative
ignorance of CVS.
   
   
   
How do others avoid this problem?
   
   
   
Any advice or suggestions would be appreciated.
   
   
   
Thanks,
   
   
   
Chuck
   
   
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene.

2004-11-04 Thread Chuck Williams
Yes, by one or multiple criteria.

Chuck

   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 6:21 PM
   To: 'Lucene Users List'
   Subject: Sorting in Lucene.
   
   Hi All,
   
   
   
   Does Lucene supports sorting on the search results?
   
   
   
   Thanks in advance.
   
   Ramon


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene.

2004-11-04 Thread Chuck Williams
Ramon,

I'm not sure where a guide or tutorial might be, but you should be able
to see how to do it from the javadoc.  Look at classes Sort, SortField,
SortComparator.  I've also included a recent message from this group
below concerning sorting with multiple fields.  FYI, a number of people
have wanted to first sort by score and secondarily by another field.
This is tricky since scores are frequently different in low-order
decimal positions.

Good luck,

Chuck

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 04, 2004 1:33 AM
To: Lucene Users List
Subject: Re: sorting by score and an additional field

On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote:
 Has anyone had any luck using lucene's built in sort functions to sort
 first by the lucene hit score and secondarily by a Field in each
 document indexed as Keyword and in integer form?

I get multiple sort fields to work, here's two examples:

 new Sort(new SortField[]{
   new SortField(category),
   SortField.FIELD_SCORE,
   new SortField(pubmonth, SortField.INT, true)
 });

new Sort(new SortField[] {SortField.FIELD_SCORE, new 
SortField(category)})

Both of these, on a tiny dataset of only 10 documents, works exactly as 
expected.

 I can only get it to sort by one or the other... but when it does one,
 it does sort correctly, but together in {score, custom_field} only the
 first sort seems to apply.

 Any ideas?

Are you using Lucene 1.4.2?  How did you index your integer field?  Are 
you simply using the .toString() of an Integer?  Or zero padding the 
field somehow?  You can use the .toString method, but you have to be 
sure that the sorting code does the right parsing of it - so you might 
need to specify SortField.INT as its type.  It will do automatic 
detection if the type is not specified, but that assumes that the first 
document it encounters parses properly, otherwise it will fall back to 
using a String sort.

Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 9:53 PM
   To: 'Lucene Users List'
   Subject: RE: Sorting in Lucene.
   
   Hi Chuck,
   
   Can you please point me to some articles or FAQ about Sorting in
Lucene?
   
   Thanks a lot for your reply.
   
   Thanks,
   Ramon
   
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 9:44 PM
   To: Lucene Users List
   Subject: RE: Sorting in Lucene.
   
   Yes, by one or multiple criteria.
   
   Chuck
   
  -Original Message-
  From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 04, 2004 6:21 PM
  To: 'Lucene Users List'
  Subject: Sorting in Lucene.
 
  Hi All,
 
 
 
  Does Lucene supports sorting on the search results?
 
 
 
  Thanks in advance.
 
  Ramon
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Aliasing problem

2004-10-26 Thread Chuck Williams
Looks like you produced a PhraseQuery rather than a BooleanQuery.  You
want

+GAME:(doom3 3 doom)

Chuck

   -Original Message-
   From: Abhay Saswade [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, October 26, 2004 10:22 AM
   To: [EMAIL PROTECTED]
   Subject: Aliasing problem
   
   Hi,
   
   One document in my index contains term 'doom 3' (indexed, tokenized,
   stored)
   How can I match term doom3 with that document?
   
   I tried following but no luck
   I have written alias filter which returns 2 more tokens for doom3 as
3
   and
   doom
   
   I construct query +GAME:doom3
   QueryParser returns +GAME:doom3 3 doom
   
   I am using StandardTokenizer
   
   Is my approach is correct? Or am I missing something? Any help
highly
   appreciated.
   
   Thanks in advance,
   Abhay
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Range Query

2004-10-20 Thread Chuck Williams
Karthik,

It is all spelled out in a Lucene HowTo here:
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

Have fun with it,

Chuck

   -Original Message-
   From: Karthik N S [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, October 20, 2004 12:15 AM
   To: Lucene Users List; Jonathan Hager
   Subject: RE: Range Query
   
   Hi
   
  Jonathan
   
   
 When searching I also pad the query term ???
   
  When Exactly are u handling this  [ using During Indexing Process
   Also or
   while  Search on Process Only  ]
   
  Can u be Please  be specific.
   
  [  if time permits and possible please can u send me the sample
Code
   for
   the same ]
   
  . :)
   
   
Thx in advance
   
   
   -Original Message-
   From: Jonathan Hager [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, October 20, 2004 3:31 AM
   To: Lucene Users List
   Subject: Re: Range Query
   
   
   That is exactly right.  It is searching the ASCII.  To solve it I
pad
   my price using a method like this:
   
 /**
  * Pads the Price so that all prices are the same number of
characters
   and
  * can be compared lexigraphically.
  * @param price
  * @return
  */
 public static String formatPriceAsString(Double price) {
   if (price == null) {
 return null;
   }
   return PRICE_FORMATTER.format(price.doubleValue());
 }
   
   where PRICE_FORMATTER contains enough digits for your largest
number.
   
 private static final DecimalFormat PRICE_FORMATTER = new
   DecimalFormat(000.00);
   
   When searching I also pad the query term.  I looked into hooking
into
   QueryParser, but since the lower/upper prices for my application are
   different inputs, I choose to handle them without hooking into the
   QueryParser.
   
   Jonathan
   
   
   On Tue, 19 Oct 2004 12:35:06 +0530, Karthik N S
   [EMAIL PROTECTED] wrote:
   
Hi
   
Guys
   
Apologies.
   
I  have  a Field Type  Text  'ItemPrice' ,  Using it to Store  
   Price
Factor in numeric  such as  10, 25.25 , 50.00
   
If I am suppose to Find the Range factor  between 2   prices
   
ex -
 Contents:shoes +ItemPrice:[10.00 TO 50.60]
   
I get results  other  then the Range that has been  executed
[This
   may
   be
due to query parsing the Ascii values instead of  numeric values ]
   
Am  I am missing something in the Querry syntax  or Is this the
wrong
   way
   to
construct the Query.
   
Please Somebody Advise me ASAP.  :(
   
Thx in advance
   
  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Range Query

2004-10-19 Thread Chuck Williams
Range queries use a lexicographic (dictionary) order.  So, assuming all
your values are positive, you need to ensure that the integer part of
each number has a fixed number of digits (pad with leading 0's).  The
fractional part should be fine, although 1.0 will follow 1.  If you have
negative numbers you need to pad an extra 0 on the left of the
positives, start the negatives with -, and invert the magnitude of the
negatives (so they go in the other order).

Your actual example below should work as is, except that 10 will not be
in the range since 10.00 is strictly after 10.  However, this won't work
without the padding assuming you have any prices with at an integer part
of other than exactly two digits (e.g., 10 is before 6, but after 06).

Chuck

   -Original Message-
   From: Karthik N S [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, October 19, 2004 12:05 AM
   To: LUCENE
   Subject: Range Query
   
   
   Hi
   
   Guys
   
   Apologies.
   
   
   
   I  have  a Field Type  Text  'ItemPrice' ,  Using it to Store  
Price
   Factor in numeric  such as  10, 25.25 , 50.00
   
   If I am suppose to Find the Range factor  between 2   prices
   
   ex -
Contents:shoes +ItemPrice:[10.00 TO 50.60]
   
   
   I get results  other  then the Range that has been  executed   [This
may
   be
   due to query parsing the Ascii values instead of  numeric values ]
   
   Am  I am missing something in the Querry syntax  or Is this the
wrong
   way to
   construct the Query.
   
   Please Somebody Advise me ASAP.  :(
   
   Thx in advance
   
   
   
   
 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index and Search Phrase Documents

2004-10-18 Thread Chuck Williams
You haven't provided enough information for anybody to help.  Have you added indexed 
Field's to your document?  If not, there is nothing to search.  I don't think you are 
looking for a parameter to the IndexWriter constructor.  I expect the advice from 
Aviran is best.  You should read and understand the demo apps.  That's how I got 
started -- the demo apps are quite illuminating about how to index, how to search, how 
to incrementally index, etc.  They work and they show the techniques that you can 
readily adapt to your app.

Also, I've taken the liberty to move this thread to the more appropriate mail list.

Good luck,

Chuck

 -Original Message-
 From: PROYECTA.Fernandez Garcia, Ivan
 [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 18, 2004 8:13 AM
 To: Lucene Developers List
 Subject: RE: Index and Search Phrase Documents
 
 I´m looking for information about this question in this page but i can not
 resolve my problem.
 After index a document i search text and no hits are returned when there
 are
 two or three to return.
 Why?.
 
 -Mensaje original-
 De: Aviran [mailto:[EMAIL PROTECTED]
 Enviado el: lunes, 18 de octubre de 2004 17:08
 Para: 'Lucene Developers List'
 Asunto: RE: Index and Search Phrase Documents
 
 
 Lucene comes with demo apps that you can learn from. You can read about it
 here http://jakarta.apache.org/lucene/docs/demo.html
 
 Aviran
 http://aviran.mordos.com
 
 
 -Original Message-
 From: PROYECTA.Fernandez Garcia, Ivan
 [mailto:[EMAIL PROTECTED]
 
 Sent: Monday, October 18, 2004 10:18 AM
 To: [EMAIL PROTECTED]
 Subject: Index and Search Phrase Documents
 
 
 Hy everybody,
 
   I want to index a document text.
   We would like to know which parameter must I use (in IndexWriter
 constructor)  to index a document if i can search text after?
   If I want to search phrases. What class must i use to do this?.
   It would be grateful if you send me an example.
   Thanks very much.
 
 
 
  Iván Fernández García
  Proyecta Sistemas de Información
 
 
 
 
 
 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
 
 
 --
 Has decidido el mejor precio.  Has decidido IBERIA.com
 You´ve chosen the best price. You´ve chosen  IBERIA.com
 --
 http://www.iberia.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 ---
 Incoming mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
 
 
 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
 
 
 --
 Has decidido el mejor precio.  Has decidido IBERIA.com
 You´ve chosen the best price. You´ve chosen  IBERIA.com
 --
 http://www.iberia.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: index, reindexing problem

2004-10-17 Thread Chuck Williams
I had this same problem a while back.  It should be resolved if you move
the writer = new IndexWriter(...) until after the reader.close().  I.e.,
complete all the deletions and close the reader before creating the
writer.

Chuck

 -Original Message-
 From: MATL (Mats Lindberg) [mailto:[EMAIL PROTECTED]
 Sent: Sunday, October 17, 2004 5:36 AM
 To: [EMAIL PROTECTED]
 Subject: index, reindexing problem
 
 Hello.
 
 I have a problem when reindexing some documents after an index has
been
 created, i get an error, the error is the following.
 caught a class java.io.IOException
 
 with message: Lock obtain timed out:

[EMAIL PROTECTED]:\DOCUME~1\..lucene-0b877c2d5472a608d6ec3ee6174018de-write
 .lock

mailto:[EMAIL PROTECTED]:\DOCUME~1\..lucene-0b877c2d5472a608d6ec3ee6174018
 de-write.lock
 
 
 This is how i do it.
 1.st make the index (_indexDir is the location of the index)
 writer = new IndexWriter(_indexDir, new StandardAnalyzer(), true);
 
 . do the indexing here
 
 writer.optimize();
 
 writer.close();
 
 this works fine
 
 
 2. this is where i get the error (reindex an existing document)
 writer = new IndexWriter(_indexDir, new StandardAnalyzer(), false);
 Directory directory;
 
 IndexReader reader;
 
 // if the file is in the index already, remove it
 
 directory = FSDirectory.getDirectory(_indexDir, false);
 
 reader = IndexReader.open(directory);
 
 try {
 
 Term term = new Term(deleteid, deleteID.toLowerCase());
 
 if (reader.docFreq(term) = 1) {
 
 deletedItems = reader.delete(term);// - this is where the error
 occurs, i get the locking error
 
 }
 
 } catch (Exception e) {
 
 System.out.println( caught a  + e.getClass() + \n with message:  +
 e.getMessage());}
 
 finally {
 
 reader.close();
 
 directory.close();
 
 }
 
 continue with reindexing the new document
 
 ..
 
 
 
 I hope anyone can help me with this problem.
 
 
 
 Best regards,
 
 Mats Lindberg
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Filtering Results?

2004-10-14 Thread Chuck Williams
Ahh yes, that is a good article.  I inadvertently missed the need to
invert the magnitude of negative numbers in the recipe below (I don't
have negatives in any of my fiels).  Fortunately that is also easy to
do.

FYI, you don't need a custom query parser for range queries.  That's
only required if you expect your users to type in range query syntax (so
that you have to convert their numbers to your formatted
representation).  Rather than expect the user to type in that syntax, I
provide text input fields for the range bounds in range-searchable
fields.  You can then either generate standard range query syntax (using
the string-formatted encoding of numbers) or generate the RangeQuery
objects directly, depending on how you are constructing your queries
(with or without QueryParser).  

Chuck

 -Original Message-
 From: sam s [mailto:[EMAIL PROTECTED]
 Sent: Thursday, October 14, 2004 11:22 AM
 To: [EMAIL PROTECTED]
 Subject: RE: Filtering Results?
 
 Thanks Chuck.
 Meanwhile searching on net and found this link
 http://wiki.apache.org/jakarta-lucene/SearchNumericalFields
 Thanks again
 
 
 From: Chuck Williams [EMAIL PROTECTED]
 Reply-To: Lucene Users List [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Subject: RE: Filtering Results?
 Date: Thu, 14 Oct 2004 09:55:07 -0700
 
 Sam,
 
 You can pick any encoding such that lexicographic order (alphabetic
 order) is consistent with the numeric order you want.  E.g., if a
single
 field can contain positive or negative integers or floats, then the
 following should work:
 1.  First character of every value represents the sign.  You can't
use +
 and - since + is alphabetically before - (which would make positives
 smaller than negatives), so pick a different character to represent +
 like maybe =.
 2.  Characters 2 through n are a fixed length string that presents
the
 integer part of the number, padded with leading zeroes.
 3.  You don't need padding on the right since longer strings
 alphabetically follow shorter strings.  Just included the decimal
point
 if the number is float, and trail out whatever remaining digits
 naturally print.
 4.  One other subtlety occurs if you need to ensure that 2 and 2.0
are
 equal.  You need to transform one to the other (if you can have both
 integers and floats in a single field -- otherwise this is not an
 issue).  You will lose information about the original type.
 
 I haven't tested the above, but think it should work.
 
 Chuck
 
   -Original Message-
   From: sam s [mailto:[EMAIL PROTECTED]
   Sent: Thursday, October 14, 2004 6:40 AM
   To: [EMAIL PROTECTED]
   Subject: RE: Filtering Results?
  
   Thanks Chuck.
  
   What is the workaround for filtering (preferably using RangeQuery)
   following?
   1. Float values. Do I have to pad those with zeros on both sides?
   2. Negative numbers (integer as well as floats)
  
   Thanks
  
   From: Chuck Williams [EMAIL PROTECTED]
   Reply-To: Lucene Users List [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Subject: RE: Filtering Results?
   Date: Wed, 13 Oct 2004 21:49:30 -0700
   
   RangeQuery is a good approach.  Put fields on your documents like
 age.
   The only tricky thing is that the comparisons are all done
   lexicographically rather that numerically.  Lucene has a built-in
   routine to convert dates into a monotonic lexicographic sequence
   (DateField.timeToString).  For positive integer data types like
age,
 it
   is sufficient to store them as fixed line String's, e.g.:
  5 -- 005
 18 -- 018
   100 -- 100
   
   Then just access range queries.  E.g.:
   1.  age:[018 TO]
   2.  age:[TO 018]
   3.  age:[005 TO 018]
   
   Those are = queries.  Use {} instead of [] for  queries.
   
   Good luck,
   
   Chuck
   
 -Original Message-
 From: sam s [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 13, 2004 12:55 PM
 To: [EMAIL PROTECTED]
 Subject: Filtering Results?

 Hi,
 I want to do filtering on matched results of a query.
 For example
 1. age  18
 2. age  18
 3. age  5 and age  18
 4. birthdate = [some date]
 What can be the best approach?
 How can it be done with range query?
 Can it be done without range query?

 Also.
 Where can I find information meaning of following classes and
how
 to
   use
 them?
 FilteredQuery
 QueryFilter (I didnt understand much looking at test case of
this)
 CachingWrapperFilter
 etc..

 Thanks in advance


_
 Don't just search. Find. Check out the new MSN Search!
 http://search.msn.com/



 -
 To unsubscribe, e-mail:
[EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail:
[EMAIL PROTECTED