Re: Aramorph Analyzer

2004-12-20 Thread Pierrick Brihaye
Hi,
Sorry, I (the aramorph maintainer ;-) was absent from the office...
Daniel Naber a crit :
Analyzers that provide ambiguous terms (i.e. a token with more than one term 
at the same position) don't work in Lucene 1.4.
The is the correct answer. I've filled a bug about this : 
http://issues.apache.org/bugzilla/show_bug.cgi?id=23307

This feature has only 
recently been added to CVS.
... and I thank you very much for this commit.
Notice however that you may experiment some problems with the query 
parser because Buckwalter's arabic transliteration uses the standard * 
joker character as a representation for dhal.

Notice also that aramorph has a mailing-list for such questions :
http://lists.nongnu.org/mailman/listinfo/aramorph-users
Cheers,
--
Pierrick Brihaye, informaticien
Service rgional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
+33 (0)2 99 29 67 78
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Optimising A Security Filter

2004-12-20 Thread Paul Elschot
On Sunday 19 December 2004 23:05, Steve Skillcorn wrote:
 Hello All;
 
 I bought the Lucene in Action ebook, which is
 excellent and I can strongly recommend.  One question
 that has arisen from the book though is custom
 filters.
 
 I have the situation where the text of my docs is in
 Lucene, but the permissions are in my RDBMS.  I can
 write a filter (in fact have done so) that loops
 through the documents in the passed IndexReader and
 queries the DB to detect if the user is permissioned
 for them, setting the relevant BitSet.  My results are
 then paged ( last | next ) to a web page.
 
 Does the IndexReader that is passed to the “bits”
 method of the filter represent the entire index, or
 just the results that match the query?

The IndexReader represents the entire index.

 Is not worrying about filters and simply checking the
 returned Hit List before presenting a sensible
 approach?

That's is done by the IndexSearcher.search() methods
that take a filter argument.
 
 I can see the point to filters as presented in the
 Lucene in Action ISBN example, but are they a good
 approach where they could end up laboriously marking
 the entire index as True?

The filter is checked only for search results on the query
over the whole index.

The bit filters generally work well, except when you need
a lot of very sparse filters and memory is a concern.

Regards,
Paul Elschot
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Relevance percentage

2004-12-20 Thread Gururaja H
How to find out the percentages of matched terms in the document(s) using 
Lucene ?
Here is an example, of what i am trying to do:
The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 
matching 
documents with the following attributes:
Doc#1: contains terms(ibm,drive)
Doc#2: contains terms(ibm,risc, tape, drive)
Doc#3: contains terms(ibm,risc, tape,drive)
Doc#4: contains terms(ibm, risc, tape, drive, manual).
The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40%
(doc#1).
 
Any help on how to go about doing this ?
 
Thanks,
Gururaja
 
 


-
Do you Yahoo!?
 Send a seasonal email greeting and help others. Do good.

Number of documents

2004-12-20 Thread Daniel Cortes
I've to show to my boss if Lucene is the best option for create a search 
engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?
the types of formats who has to support the portal are html jsp txt doc 
pdf ppt

another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to use 
the example of handling types.The folder data contains 5 files, and 
created index contain five
documents what the only one that contains any word in the index is the 
.html file
Everybody have the same result?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Optimising A Security Filter

2004-12-20 Thread Erik Hatcher
Paul already replied, but I'll add my thoughts below to the thread 
also...

On Dec 19, 2004, at 5:05 PM, Steve Skillcorn wrote:
I bought the Lucene in Action ebook, which is
excellent and I can strongly recommend.
Thank you
Does the IndexReader that is passed to the “bits”
method of the filter represent the entire index, or
just the results that match the query?
It represents the entire index at the time it was instantiated.  This 
is important to know in case documents are later added to the index.

Is not worrying about filters and simply checking the
returned Hit List before presenting a sensible
approach?
It depends.  Is the performance of checking a relational database for 
the results being shown to the user acceptable?  Is the security risk 
of a new piece of code forgetting to check the results of a search 
worth it?

I can see the point to filters as presented in the
Lucene in Action ISBN example, but are they a good
approach where they could end up laboriously marking
the entire index as True?
Iterating through every document in the index certainly is time 
consuming and not something you should do for every search.  However, 
filters are designed to be long-lived.  Write your filter to simply do 
the logic of checking each document against the database, then wrap 
your filter with the caching wrapper.  Be sure to use the same 
IndexReader for each search.  When the index changes, rebuild the 
filter.

There is no clear best way to do this type of filtering of results, I 
don't believe.  There are details to consider for either of these 
approaches.

All help greatly appreciated.  Thanks to the authors
for Lucene in Action, it's given me the high level
best practices I was needing.
Steve - I really appreciate hearing this.  Putting this work to public 
scrutiny opens the possibilities of opinion.  Your comments hearten me.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Number of documents

2004-12-20 Thread Erik Hatcher
On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote:
I've to show to my boss if Lucene is the best option for create a 
search engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?
I highly recommend you use Luke to examine the index.  It is a great 
tool to have handy.  It shows these statistics and many others.

the types of formats who has to support the portal are html jsp txt 
doc pdf ppt
HTML, TXT, DOC, and PDF are all quite straightforward to do.  PPT is 
possible, perhaps POI will do the trick.  JSP depends on how you want 
to analyze it.  If any text in the file should be indexed (including 
JSP directives, taglibs, and HTML) then you can treat it as a text 
file.  If you need to eliminate the tags then you'll need to parse the 
JSP somehow, however I strongly recommend that content not reside in 
JSP pages but rather in a content management system, database, or such.

another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to 
use the example of handling types.The folder data contains 5 files, 
and created index contain five
documents what the only one that contains any word in the index is the 
.html file
Everybody have the same result?
Perhaps you are taking the output you see from ant 
ExtensionFileHandler as an indication of what words were indexed.  
This output, however, is showing Document.toString() which only shows 
the text in stored fields.  This particular example does not actually 
index the documents - it shows the generalized handling framework and 
the parsing of the files into a Lucene Document.  Most of the file 
handlers use unstored fields.  The output I get is shown below.  The 
handlers have successfully extracted the text from the files.  Maybe 
you're referring to the FileIndexer example?  We did not expose this 
one to the Ant launcher.  If FileIndexer is the code you're trying, let 
me know what you've tried and how you're looking for the words that you 
expect to see.  Again, most of the fields are unstored (meaning the 
original content is not stored in the index, only the terms extracted 
through analysis).

Erik
# to make the output cleaner for e-mailing I set ANT_ARGS like this:
% echo $ANT_ARGS
-logger org.apache.tools.ant.NoBannerLogger -emacs -Dnopause=true
% ant ExtensionFileHandler 
-Dfile=src/lia/handlingtypes/data/addressbook-entry.xml
Buildfile: build.xml

ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger 
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
DocumentKeywordtype:individual Keywordname:Zane Pasolini 
Keywordaddress:999 W. Prince St. Keywordcity:New York 
Keywordprovince:NY Keywordpostalcode:10013 Keywordcountry:USA 
Keywordtelephone:+1 212 345 6789

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/HTML.html
Buildfile: build.xml
ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
DocumentTexttitle:Laptop power supplies are available in First Class 
only Textbody:Code, Write, Fly This chapter is being written 11,000 
meters above New Foundland.

% ant ExtensionFileHandler 
-Dfile=src/lia/handlingtypes/data/PlainText.txt
Buildfile: build.xml

ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
DocumentUnStoredbody
% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PDF.pdf
Buildfile: build.xml
ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause 

Re: Relevance percentage

2004-12-20 Thread Mike Snare
I'm still new to Lucene, but wouldn't that be the coord()?  My
understanding is that the coord() is the fraction of the boolean query
that matched a given document.

Again, I'm new, so somebody else will have to confirm or deny...

-Mike


On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
[EMAIL PROTECTED] wrote:
 How to find out the percentages of matched terms in the document(s) using 
 Lucene ?
 Here is an example, of what i am trying to do:
 The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 
 matching
 documents with the following attributes:
 Doc#1: contains terms(ibm,drive)
 Doc#2: contains terms(ibm,risc, tape, drive)
 Doc#3: contains terms(ibm,risc, tape,drive)
 Doc#4: contains terms(ibm, risc, tape, drive, manual).
 The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40%
 (doc#1).
 
 Any help on how to go about doing this ?
 
 Thanks,
 Gururaja
 
 
 -
 Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Gururaja H
Hi,
 
But, How to calculate the coord() fraction ?  I know by default,
in DefaultSimilarity the coord() fraction is defined as below:

/** Implemented as codeoverlap / maxOverlap/code. */

public float coord(int overlap, int maxOverlap) {

return overlap / (float)maxOverlap;

}
How to get the overlap and maxOverlap value in each of the matched document(s) ?
 
Thanks,
Gururaja

Mike Snare [EMAIL PROTECTED] wrote:
I'm still new to Lucene, but wouldn't that be the coord()? My
understanding is that the coord() is the fraction of the boolean query
that matched a given document.

Again, I'm new, so somebody else will have to confirm or deny...

-Mike


On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
wrote:
 How to find out the percentages of matched terms in the document(s) using 
 Lucene ?
 Here is an example, of what i am trying to do:
 The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 
 matching
 documents with the following attributes:
 Doc#1: contains terms(ibm,drive)
 Doc#2: contains terms(ibm,risc, tape, drive)
 Doc#3: contains terms(ibm,risc, tape,drive)
 Doc#4: contains terms(ibm, risc, tape, drive, manual).
 The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40%
 (doc#1).
 
 Any help on how to go about doing this ?
 
 Thanks,
 Gururaja
 
 
 -
 Do you Yahoo!?
 Send a seasonal email greeting and help others. Do good.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
 All your favorites on one personal page – Try My Yahoo!

analyzer effecting phrases?

2004-12-20 Thread Peter Posselt Vestergaard
Hi
I am building an index of texts, each related to a unique id. The unique ids
might contain a number of underscores which will make the standardanalyzer
shorten them after it sees the second underscore in a row. Furthermore many
of the texts I am indexing is in Italian so the removal of 'trivial' words
done by the standard analyzer is not necessarily meaningful for these texts.
Therefore I am instead using an analyzer made from the WhitespaceTokenizer
and the LowerCaseFilter.
This works fine for me until I try searching for a phrase. I am searching
for a simple phrase containing two words and with double-quotes around it. I
have found the phrase in one of the texts so I know it should return at
least one result, but none is found. If I remove the double-quotes and
searches for the 2 words with AND between them I do find the story.
Can anyone tell me if this is an obvious (side-)effect of not using the
standard analyzer? And is there a better solution to my problem than using
the very simple analyzer?
Best regards
Peter Vestergaard
PS: I use the same analyzer for both searching and indexing (of course).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



determination of matching hits

2004-12-20 Thread Christiaan Fluit
Hello all,
I have a question regarding the determination of the set of matching 
documents, in particular (I guess) related to the NOT operator.

In my case I have a document containing the terms A and B. When I query 
for either A or for B, I get this document back, just as expected. Now 
when I query for A -B, I once again get this document back. In other 
words: this document matches both B and a query containing the clause 
-B, which theoretically should never happen.

I've seen this happen with various keywords, sometimes with multiple 
conflicting documents. In each case, the B query returned the document 
with a very low relevance (e.g. 0.007...).

Based on these low relevancies and a quick peek in the Lucene code, I 
strongly suspect that this is caused by rounding errors, as it seems to 
me that floating point numbers are used to both express the membership 
of a set as well as its score. Can somebody confirm this?

And if this is the case, is there a workaround to eliminate or at least 
significantly suppress this problem? A colleague mentioned boosting 
every term in a query, would this solve anything?

For most search engine-like applications, which order documents on 
relevance, I think this problem is not a real issue since such 
conflicting documents appear at the end of the result list and are not 
likely to be seen by the user. However, in our case we have an 
application which displays overlaps of entire result sets and these 
documents show up very prominently (I can show screenshots if desired). 
We have already been asked by customers to explain these results :)

FYI, in case it may be relevant: I'm still using Lucene 1.4.2. Every 
document has the same set of five fields. The above queries are parsed 
by MultiFieldQueryParser, using all five fields. I haven't touched the 
default operator, but the queries A AND -B and A AND NOT B give the same 
conflicting overlap in the result set.

Thanks in advance,
Christiaan Fluit
Aduna.biz
--
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Queries difference

2004-12-20 Thread Alex Kiselevski

Hello, I want to know is there a difference between queries:

+city(+London Amsterdam) +address(1_street 2_street)

And

+city(+London) +city(Amsterdam) +address(1_street)  +address(2_street)

Thanks in advance

Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
RD, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

Re: Queries difference

2004-12-20 Thread Morus Walter
Alex Kiselevski writes:
 
 Hello, I want to know is there a difference between queries:
 
 +city(+London Amsterdam) +address(1_street 2_street)
 
 And
 
 +city(+London) +city(Amsterdam) +address(1_street)  +address(2_street)
 
I guess you mean city:(... and so on.

The first query searches documents containing 'London' in city, scoring
results also containing Amsterdam higher, and containing 1_street or 2_street
in address.
The second query searches for documents containing both London and Amsterdam
in city and 1_street and 2_street in address.
Note the the + before London in the second query doesn't mean anything.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: analyzer effecting phrases?

2004-12-20 Thread Otis Gospodnetic
When searching for phrases, what's important is the position of each
token/word extracted by the Analyzer. 
WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
positional information.  There is nothing else in your Analyzer?

In any case, the following should help you see what your Analyzer is
doing:
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
augment the code there to provide positional information, too.

Otis

--- Peter Posselt Vestergaard [EMAIL PROTECTED] wrote:

 Hi
 I am building an index of texts, each related to a unique id. The
 unique ids
 might contain a number of underscores which will make the
 standardanalyzer
 shorten them after it sees the second underscore in a row.
 Furthermore many
 of the texts I am indexing is in Italian so the removal of 'trivial'
 words
 done by the standard analyzer is not necessarily meaningful for these
 texts.
 Therefore I am instead using an analyzer made from the
 WhitespaceTokenizer
 and the LowerCaseFilter.
 This works fine for me until I try searching for a phrase. I am
 searching
 for a simple phrase containing two words and with double-quotes
 around it. I
 have found the phrase in one of the texts so I know it should return
 at
 least one result, but none is found. If I remove the double-quotes
 and
 searches for the 2 words with AND between them I do find the story.
 Can anyone tell me if this is an obvious (side-)effect of not using
 the
 standard analyzer? And is there a better solution to my problem than
 using
 the very simple analyzer?
 Best regards
 Peter Vestergaard
 PS: I use the same analyzer for both searching and indexing (of
 course).
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Queries difference

2004-12-20 Thread Alex Kiselevski

Thanks Morus
So if I understand right
If the seqond query is :
+city(London) +city(Amsterdam) +address(1_street)  +address(2_street)

Both queries have the same value ?
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Monday, December 20, 2004 6:11 PM
To: Lucene Users List
Subject: Re: Queries difference


Alex Kiselevski writes:

 Hello, I want to know is there a difference between queries:

 +city(+London Amsterdam) +address(1_street 2_street)

 And

 +city(+London) +city(Amsterdam) +address(1_street)  +address(2_street)

I guess you mean city:(... and so on.

The first query searches documents containing 'London' in city, scoring
results also containing Amsterdam higher, and containing 1_street or
2_street in address. The second query searches for documents containing
both London and Amsterdam in city and 1_street and 2_street in address.
Note the the + before London in the second query doesn't mean anything.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Queries difference

2004-12-20 Thread Otis Gospodnetic
Alex, I think you want this:

+city:London +city:Amsterdam +address:1_street +address:2_street

Otis


--- Alex Kiselevski [EMAIL PROTECTED] wrote:

 
 Thanks Morus
 So if I understand right
 If the seqond query is :
 +city(London) +city(Amsterdam) +address(1_street)  +address(2_street)
 
 Both queries have the same value ?
 -Original Message-
 From: Morus Walter [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 20, 2004 6:11 PM
 To: Lucene Users List
 Subject: Re: Queries difference
 
 
 Alex Kiselevski writes:
 
  Hello, I want to know is there a difference between queries:
 
  +city(+London Amsterdam) +address(1_street 2_street)
 
  And
 
  +city(+London) +city(Amsterdam) +address(1_street) 
 +address(2_street)
 
 I guess you mean city:(... and so on.
 
 The first query searches documents containing 'London' in city,
 scoring
 results also containing Amsterdam higher, and containing 1_street or
 2_street in address. The second query searches for documents
 containing
 both London and Amsterdam in city and 1_street and 2_street in
 address.
 Note the the + before London in the second query doesn't mean
 anything.
 
 HTH
   Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated
 recipient(s)
 of the message. If the reader of this message is not the intended
 recipient,
 you are hereby notified that any dissemination, use, distribution or
 copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us
 immediately
 by replying to the message and deleting it from your computer.
 Thank you.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: determination of matching hits

2004-12-20 Thread Erik Hatcher
Christian,
Please simplify your situation.  Use a plain TermQuery for B and see 
what is returned.  Then use a simple BooleanQuery for A -B.  I 
suspect MultiFieldQueryParser is the culprit.  What does the toString 
of the generated Query return?  MFQP is known to be trouble, and an 
overhaul to it has been contributed recently.

Erik
On Dec 20, 2004, at 10:32 AM, Christiaan Fluit wrote:
Hello all,
I have a question regarding the determination of the set of matching 
documents, in particular (I guess) related to the NOT operator.

In my case I have a document containing the terms A and B. When I 
query for either A or for B, I get this document back, just as 
expected. Now when I query for A -B, I once again get this document 
back. In other words: this document matches both B and a query 
containing the clause -B, which theoretically should never happen.

I've seen this happen with various keywords, sometimes with multiple 
conflicting documents. In each case, the B query returned the 
document with a very low relevance (e.g. 0.007...).

Based on these low relevancies and a quick peek in the Lucene code, I 
strongly suspect that this is caused by rounding errors, as it seems 
to me that floating point numbers are used to both express the 
membership of a set as well as its score. Can somebody confirm this?

And if this is the case, is there a workaround to eliminate or at 
least significantly suppress this problem? A colleague mentioned 
boosting every term in a query, would this solve anything?

For most search engine-like applications, which order documents on 
relevance, I think this problem is not a real issue since such 
conflicting documents appear at the end of the result list and are not 
likely to be seen by the user. However, in our case we have an 
application which displays overlaps of entire result sets and these 
documents show up very prominently (I can show screenshots if 
desired). We have already been asked by customers to explain these 
results :)

FYI, in case it may be relevant: I'm still using Lucene 1.4.2. Every 
document has the same set of five fields. The above queries are parsed 
by MultiFieldQueryParser, using all five fields. I haven't touched the 
default operator, but the queries A AND -B and A AND NOT B give the 
same conflicting overlap in the result set.

Thanks in advance,
Christiaan Fluit
Aduna.biz
--
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


sorting on a field that can have null values

2004-12-20 Thread Praveen Peddi
Hi all,
I am getting null pointer exception when I am sorting on a field that has null 
value for some documents. Order by in sql does work on such fields and I 
think it puts all results with null values at the end of the list. Shouldn't 
lucene also do the same thing instead of throwing null pointer exception. Is 
this an expected behaviour? Is lucene always expecting some value on the 
sortable fields?

I thought of putting empty strings instead of null values but I think empty 
strings are put first in the list while sorting which is the reverse of what 
anyone would want. 

Following is the exception I saw in the error log:

java.lang.NullPointerException
 at 
org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36)
 at 
org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95)
 at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120)
 at 
org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47)
 at 
org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58)
 at 
org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130)
 at 
org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38)
 at 
org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125)
 at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64)
 at 
org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51)
 at 
org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41)

If its a bug in lucene, Will it be fixed in next release? Any suggestions would 
be appreciated.

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 




sorting on a field that can have null values

2004-12-20 Thread Praveen Peddi
Hi all,
I am getting null pointer exception when I am sorting on a field that has null 
value for some documents. Order by in sql does work on such fields and I 
think it puts all results with null values at the end of the list. Shouldn't 
lucene also do the same thing instead of throwing null pointer exception. Is 
this an expected behaviour? Is lucene always expecting some value on the 
sortable fields?

I thought of putting empty strings instead of null values but I think empty 
strings are put first in the list while sorting which is the reverse of what 
anyone would want. 

Following is the exception I saw in the error log:

java.lang.NullPointerException
 at 
org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36)
 at 
org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95)
 at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120)
 at 
org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47)
 at 
org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58)
 at 
org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130)
 at 
org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38)
 at 
org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125)
 at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64)
 at 
org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51)
 at 
org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41)

If its a bug in lucene, Will it be fixed in next release? Any suggestions would 
be appreciated.

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


RE: Relevance percentage

2004-12-20 Thread Chuck Williams
The coord() value is not saved anywhere so you would need to recompute
it.  You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord().  If your queries are all BooleanQuery's of
TermQuery's, then this is very simple.  Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses.  Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation).  If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).

I'm on the West Coast of the U.S. so evidently on a very different time
zone from you -- will look at your other message next.

Chuck

   -Original Message-
   From: Gururaja H [mailto:[EMAIL PROTECTED]
   Sent: Monday, December 20, 2004 6:10 AM
   To: Lucene Users List; Mike Snare
   Subject: Re: Relevance percentage
   
   Hi,
   
   But, How to calculate the coord() fraction ?  I know by default,
   in DefaultSimilarity the coord() fraction is defined as below:
   
   /** Implemented as codeoverlap / maxOverlap/code. */
   
   public float coord(int overlap, int maxOverlap) {
   
   return overlap / (float)maxOverlap;
   
   }
   How to get the overlap and maxOverlap value in each of the matched
   document(s) ?
   
   Thanks,
   Gururaja
   
   Mike Snare [EMAIL PROTECTED] wrote:
   I'm still new to Lucene, but wouldn't that be the coord()? My
   understanding is that the coord() is the fraction of the boolean
query
   that matched a given document.
   
   Again, I'm new, so somebody else will have to confirm or deny...
   
   -Mike
   
   
   On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
   wrote:
How to find out the percentages of matched terms in the
document(s)
   using Lucene ?
Here is an example, of what i am trying to do:
The search query has 5 terms(ibm, risc, tape, dirve, manual) and
there
   are 4 matching
documents with the following attributes:
Doc#1: contains terms(ibm,drive)
Doc#2: contains terms(ibm,risc, tape, drive)
Doc#3: contains terms(ibm,risc, tape,drive)
Doc#4: contains terms(ibm, risc, tape, drive, manual).
The percentages displayed would be 100%(Doc#4), 80%(doc#2),
80%(doc#3)
   and 40%
(doc#1).
   
Any help on how to go about doing this ?
   
Thanks,
Gururaja
   
   
-
Do you Yahoo!?
Send a seasonal email greeting and help others. Do good.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   -
   Do you Yahoo!?
All your favorites on one personal page - Try My Yahoo!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: analyzer effecting phrases?

2004-12-20 Thread Peter Posselt Vestergaard
Hi again
Thanks for your answer, Otis. My analyzer did not do anything else than
using the WhitespaceAnalyzer/LowerCaseFilter.
However I found out that I got problems with characters such as ,.: when
searching because of my simple analyzer. (E.g. I would not be able to search
for world in the string Hello world. as . became part of the last word).

Therefore I turned back to the standard analyzer and now do some replacing
of the underscores in my ID string to avoid my original problem. This solved
my phrase problem so that I can now search for phrases. However I still have
the problem with ,.: described above. As far as I can see the
StandardAnalyzer (the StandardTokenizer that is) should tokenize words
without the ,.: characters. Am I mistaken? Is there a tokenizer that will
do this?
Thanks for the help!
Regards
Peter

 Date: Mon, 20 Dec 2004 08:19:42 -0800 (PST)
 From: Otis Gospodnetic [EMAIL PROTECTED]
 Subject: analyzer effecting phrases?
 Content-Type: text/plain; charset=us-ascii
 
 
 When searching for phrases, what's important is the position of each
 token/word extracted by the Analyzer. 
 WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
 positional information.  There is nothing else in your Analyzer?
 
 In any case, the following should help you see what your Analyzer is
 doing:
 http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
 augment the code there to provide positional information, too.
 
 Otis
 
 -Original Message-
 From: Peter Posselt Vestergaard [mailto:[EMAIL PROTECTED] 
 Sent: 20. december 2004 15:24
 To: '[EMAIL PROTECTED]'
 Subject: analyzer effecting phrases?
 
 Hi
 I am building an index of texts, each related to a unique id. 
 The unique ids might contain a number of underscores which 
 will make the standardanalyzer shorten them after it sees the 
 second underscore in a row. Furthermore many of the texts I 
 am indexing is in Italian so the removal of 'trivial' words 
 done by the standard analyzer is not necessarily meaningful 
 for these texts. Therefore I am instead using an analyzer 
 made from the WhitespaceTokenizer and the LowerCaseFilter.
 This works fine for me until I try searching for a phrase. I 
 am searching for a simple phrase containing two words and 
 with double-quotes around it. I have found the phrase in one 
 of the texts so I know it should return at least one result, 
 but none is found. If I remove the double-quotes and searches 
 for the 2 words with AND between them I do find the story.
 Can anyone tell me if this is an obvious (side-)effect of not 
 using the standard analyzer? And is there a better solution 
 to my problem than using the very simple analyzer?
 Best regards
 Peter Vestergaard
 PS: I use the same analyzer for both searching and indexing 
 (of course).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Relevance and ranking ...

2004-12-20 Thread Chuck Williams
I believe your sole problem is that you need to tone down your
lengthNorm.  Because doc4 is 10 times longer than doc2, its lengthNorm
is less than 1/3 of that of doc2 (1/sqrt(10) to be precise).  This is a
larger effect than the higher coord factor (1/.8) and the extra matching
term in doc4.

In your original description, it sounds like you want coord() to
dominate lengthNorm(), with lengthNorm() just being used as a
tie-breaker among queries with the same coord().

To achieve this, you need to reduce the impact of the lengthNorm()
differences, by changing the sqrt() function in the computation of
lengthNorm to something much flatter.  E.g., you might use:

  public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.log10(1000+numTerms));
  }

I'm not sure whether that specific formula will work, but you can find
one that will by adjusting the base of the logarithm and the additive
constant (1000 in the example).

Some general things:
  1.  You need to reindex when you change the Similarity (it is used for
indexing and searching -- e.g., the lengthNorm's are computed at index
time).
  2.  Be careful not to overtune your scoring for just one example.  Try
many examples.  You won't be able to get it perfect -- the idea is to
get close to your subjective judgments as frequently as possible.
  3.  The idea here is to find a value of lengthNorm() that doesn't
override coord, but still provides the tie-breaking you are looking for
(doc2 ahead of doc3).

Chuck

   -Original Message-
   From: Gururaja H [mailto:[EMAIL PROTECTED]
   Sent: Sunday, December 19, 2004 10:10 PM
   To: Lucene Users List
   Subject: RE: Relevance and ranking ...
   
   Chuck Williams,
   
   Thanks for the reply. Source code and Output are below.
   
   Please give me your inputs.
   
   Default document order i am getting is: Doc#2, Doc#4, Doc#3, Doc#1.
   Document order needed is: Doc#4, Doc#2, Doc#3, Doc#1.
   
   Let me know, if you need more information.
   
   NOTE: Using Luene Query object not BooleanQuery.
   
   Here is the source code:
   
   Searcher searcher = new IndexSearcher(index);
   
 Analyzer analyzer = new StandardAnalyzer();
 BufferedReader in = new BufferedReader(new
   InputStreamReader(System.in));
System.out.print(Query: );
String line = in.readLine();
Query query = QueryParser.parse(line, contents, analyzer);
System.out.println(Searching for:  + query.toString(contents));
Hits hits = searcher.search(query);
System.out.println(hits.length() +  total matching documents);
  for (int i = start; i  hits.length(); i++) {
Document doc = hits.doc(i);
  System.out.print(Score is: + hits.score(i));
  // Use whatever your fields are here:
  System.out.print(  title:);
  System.out.print(doc.get(title));
  System.out.print( description:);
  System.out.println(doc.get(description));
  // End of fields
  System.out.println(searcher.explain(query, hits.id(i)));
//System.out.println(Score of the document is:
+hits.score(i));
String path = doc.get(path);
if (path != null) {
 System.out.println(i + .  + path);
   System.out.println(--);
}
 ---
   
   
   Here is the output from the program:
   
   Query: ibm risc tape drive manual
   
   Searching for: ibm risc tape drive manual
   
   4 total matching documents
   
   Score is: 0.16266039 title:null description:null
   
   0.16266039 = product of:
   
   0.20332548 = sum of:
   
   0.03826245 = weight(contents:ibm in 1), product of:
   
   0.31521872 = queryWeight(contents:ibm), product of:
   
   0.7768564 = idf(docFreq=4)
   
   0.40576187 = queryNorm
   
   0.121383816 = fieldWeight(contents:ibm in 1), product of:
   
   1.0 = tf(termFreq(contents:ibm)=1)
   
   0.7768564 = idf(docFreq=4)
   
   0.15625 = fieldNorm(field=contents, doc=1)
   
   0.06340029 = weight(contents:risc in 1), product of:
   
   0.40576187 = queryWeight(contents:risc), product of:
   
   1.0 = idf(docFreq=3)
   
   0.40576187 = queryNorm
   
   0.15625 = fieldWeight(contents:risc in 1), product of:
   
   1.0 = tf(termFreq(contents:risc)=1)
   
   1.0 = idf(docFreq=3)
   
   0.15625 = fieldNorm(field=contents, doc=1)
   
   0.06340029 = weight(contents:tape in 1), product of:
   
   0.40576187 = queryWeight(contents:tape), product of:
   
   1.0 = idf(docFreq=3)
   
   0.40576187 = queryNorm
   
   0.15625 = fieldWeight(contents:tape in 1), product of:
   
   1.0 = tf(termFreq(contents:tape)=1)
   
   1.0 = idf(docFreq=3)
   
   0.15625 = fieldNorm(field=contents, doc=1)
   
   0.03826245 = weight(contents:drive in 1), product of:
   
   0.31521872 = queryWeight(contents:drive), product of:
   
   0.7768564 = idf(docFreq=4)
   
   0.40576187 = queryNorm
   
   0.121383816 = fieldWeight(contents:drive in 1), product of:
   
   1.0 = tf(termFreq(contents:drive)=1)
   
   0.7768564 = 

Re: Relevance percentage

2004-12-20 Thread Paul Elschot
On Monday 20 December 2004 15:09, Gururaja H wrote:
 Hi,
  
 But, How to calculate the coord() fraction ?  I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as codeoverlap / maxOverlap/code. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched 
document(s) ?

In case you only want the coordination factor to have more influence
in the order of your search results you can use a Similarity with
a coord() function that has a power higher than 1:

  public float coord(int overlap, int maxOverlap) {
return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER);
  }

I'd first try values between 3.0f and 5.0f for SOME_POWER.

The searching code precomputes all coord values once per query
per search, so there is no need to worry about the computing efficiency.

This has the advantage that the other scoring factors are still used
for ranking.

Since the other factors can vary quite a bit, it is difficult to guarantee
that any coord() implementation will provide a score that sorts by the
number of matching clauses. Higher powers as above can come
a long way, though.

Regards,
Paul Elschot


  
 Thanks,
 Gururaja
 
 Mike Snare [EMAIL PROTECTED] wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the document(s) using 
Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 
4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 
40%
  (doc#1).
  
  Any help on how to go about doing this ?
  
  Thanks,
  Gururaja
  
  
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
   
 -
 Do you Yahoo!?
  All your favorites on one personal page – Try My Yahoo!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: analyzer effecting phrases?

2004-12-20 Thread Erik Hatcher
On Dec 20, 2004, at 12:43 PM, Peter Posselt Vestergaard wrote:
Therefore I turned back to the standard analyzer and now do some 
replacing
of the underscores in my ID string to avoid my original problem. This 
solved
my phrase problem so that I can now search for phrases. However I 
still have
the problem with ,.: described above. As far as I can see the
StandardAnalyzer (the StandardTokenizer that is) should tokenize words
without the ,.: characters. Am I mistaken? Is there a tokenizer that 
will
do this?
StandardAnalyzer does tokenize without ,.:, though it will keep 
domain names together.  Here's an example:

$ ant -emacs AnalyzerDemo
Buildfile: build.xml
AnalyzerDemo:
  Demonstrates analysis of sample text.
  Refer to the Analysis chapter for much more on this
  extremely crucial topic.
Press return to continue...
String to analyze: [This string will be analyzed.]
Example with commas, colons, and dots.  You can get this code from 
http://www.lucenebook.com
Running lia.analysis.AnalyzerDemo...
Analyzing Example with commas, colons, and dots.  You can get this 
code from http://www.lucenebook.com;
  WhitespaceAnalyzer:
[Example] [with] [commas,] [colons,] [and] [dots.] [You] [can] 
[get] [this] [code] [from] [http://www.lucenebook.com]

  SimpleAnalyzer:
[example] [with] [commas] [colons] [and] [dots] [you] [can] [get] 
[this] [code] [from] [http] [www] [lucenebook] [com]

  StopAnalyzer:
[example] [commas] [colons] [dots] [you] [can] [get] [code] [from] 
[http] [www] [lucenebook] [com]

  StandardAnalyzer:
[example] [commas] [colons] [dots] [you] [can] [get] [code] [from] 
[http] [www.lucenebook.com]


BUILD SUCCESSFUL
Total time: 7 seconds
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: determination of matching hits

2004-12-20 Thread Christiaan Fluit
ok, I feel a bit stupid now ;) Turns out this issue has been discussed a 
while ago on both mailing lists and I even participated in one of 
them... shame on me.

The problem is indeed in how MFQP parses my query: the query A -B becomes:
(text:A -text:B) (title:A -title:B) (path:A -path:B) (summary:A 
-summary:B) (agent:A -agent:B)

whereas I intuitively expexted it to be evaluated as A in any field and 
not B in any field. When I use a normal QueryParser and let it use a 
single field only, everything works as expected.

Browsing the lists archives I see that there were some efforts from 
different people in solving this issue, but I'm a bit confused about the 
final outcome. Was this solved in the MFQP in 1.4.3? If not, what 
alternative implementation of MFPQ can I currently use best?

Kind regards,
Chris
--
Erik Hatcher wrote:
Christian,
Please simplify your situation.  Use a plain TermQuery for B and see 
what is returned.  Then use a simple BooleanQuery for A -B.  I suspect 
MultiFieldQueryParser is the culprit.  What does the toString of the 
generated Query return?  MFQP is known to be trouble, and an overhaul to 
it has been contributed recently.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: determination of matching hits

2004-12-20 Thread Chuck Williams
This is not the official recommendation, but I'd suggest you are least
consider:  http://issues.apache.org/bugzilla/show_bug.cgi?id=32674

If you're not using Java 1.5 and you decide you want to use it, you'd
need to take out those dependencies.  If you improve it, please share.

Chuck

   -Original Message-
   From: Christiaan Fluit [mailto:[EMAIL PROTECTED]
   Sent: Monday, December 20, 2004 2:51 PM
   To: Lucene Users List
   Subject: Re: determination of matching hits
   
   ok, I feel a bit stupid now ;) Turns out this issue has been
discussed a
   while ago on both mailing lists and I even participated in one of
   them... shame on me.
   
   The problem is indeed in how MFQP parses my query: the query A -B
   becomes:
   
   (text:A -text:B) (title:A -title:B) (path:A -path:B) (summary:A
   -summary:B) (agent:A -agent:B)
   
   whereas I intuitively expexted it to be evaluated as A in any field
and
   not B in any field. When I use a normal QueryParser and let it use
a
   single field only, everything works as expected.
   
   Browsing the lists archives I see that there were some efforts from
   different people in solving this issue, but I'm a bit confused about
the
   final outcome. Was this solved in the MFQP in 1.4.3? If not, what
   alternative implementation of MFPQ can I currently use best?
   
   
   Kind regards,
   
   Chris
   --
   
   Erik Hatcher wrote:
Christian,
   
Please simplify your situation.  Use a plain TermQuery for B and
see
what is returned.  Then use a simple BooleanQuery for A -B.  I
   suspect
MultiFieldQueryParser is the culprit.  What does the toString of
the
generated Query return?  MFQP is known to be trouble, and an
overhaul
   to
it has been contributed recently.
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



index size doubled?

2004-12-20 Thread aurora
I'm testing the rebuilding of the index. I add several hundred documents,  
optimize and add another few hundred and so on. Right now I have around  
7000 files. I observed after the index gets to certain size. Everytime  
after optimize, the are two files roughly the same size like below:

12/20/2004  01:57p  13 deletable
12/20/2004  01:57p  29 segments
12/20/2004  01:53p  14,460,367 _5qf.cfs
12/20/2004  01:57p  15,069,013 _5zr.cfs
The index total index is double of what I expect. This is not always  
reproducible. (I'm constantly tuning my program and the set of document).  
Sometime I get a decent single document after optimize. What was happening?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Relevance and ranking ...

2004-12-20 Thread Gururaja H
Hi Chuck Williams,  Paul Elschot,
 
Thanks so much for the reply.
 
By overriding the coord() as follows, able to get the right order for the 
example that i gave in this thread.
 
public float coord(int overlap, int maxOverlap) {
return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER);
  }

Using 2.0f for SOME_POWER.
 
As Chuck Williams suggested i am trying more example cases.
 
Thanks, Again.
 
Gururaja


Chuck Williams [EMAIL PROTECTED] wrote:
I believe your sole problem is that you need to tone down your
lengthNorm. Because doc4 is 10 times longer than doc2, its lengthNorm
is less than 1/3 of that of doc2 (1/sqrt(10) to be precise). This is a
larger effect than the higher coord factor (1/.8) and the extra matching
term in doc4.

In your original description, it sounds like you want coord() to
dominate lengthNorm(), with lengthNorm() just being used as a
tie-breaker among queries with the same coord().

To achieve this, you need to reduce the impact of the lengthNorm()
differences, by changing the sqrt() function in the computation of
lengthNorm to something much flatter. E.g., you might use:

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.log10(1000+numTerms));
}

I'm not sure whether that specific formula will work, but you can find
one that will by adjusting the base of the logarithm and the additive
constant (1000 in the example).

Some general things:
1. You need to reindex when you change the Similarity (it is used for
indexing and searching -- e.g., the lengthNorm's are computed at index
time).
2. Be careful not to overtune your scoring for just one example. Try
many examples. You won't be able to get it perfect -- the idea is to
get close to your subjective judgments as frequently as possible.
3. The idea here is to find a value of lengthNorm() that doesn't
override coord, but still provides the tie-breaking you are looking for
(doc2 ahead of doc3).

Chuck

 -Original Message-
 From: Gururaja H [mailto:[EMAIL PROTECTED]
 Sent: Sunday, December 19, 2004 10:10 PM
 To: Lucene Users List
 Subject: RE: Relevance and ranking ...
 
 Chuck Williams,
 
 Thanks for the reply. Source code and Output are below.
 
 Please give me your inputs.
 
 Default document order i am getting is: Doc#2, Doc#4, Doc#3, Doc#1.
 Document order needed is: Doc#4, Doc#2, Doc#3, Doc#1.
 
 Let me know, if you need more information.
 
 NOTE: Using Luene Query object not BooleanQuery.
 
 Here is the source code:
 
 Searcher searcher = new IndexSearcher(index);
 
 Analyzer analyzer = new StandardAnalyzer();
 BufferedReader in = new BufferedReader(new
 InputStreamReader(System.in));
 System.out.print(Query: );
 String line = in.readLine();
 Query query = QueryParser.parse(line, contents, analyzer);
 System.out.println(Searching for:  + query.toString(contents));
 Hits hits = searcher.search(query);
 System.out.println(hits.length() +  total matching documents);
 for (int i = start; i  hits.length(); i++) {
 Document doc = hits.doc(i);
 System.out.print(Score is: + hits.score(i));
 // Use whatever your fields are here:
 System.out.print( title:);
 System.out.print(doc.get(title));
 System.out.print( description:);
 System.out.println(doc.get(description));
 // End of fields
 System.out.println(searcher.explain(query, hits.id(i)));
 //System.out.println(Score of the document is:
+hits.score(i));
 String path = doc.get(path);
 if (path != null) {
 System.out.println(i + .  + path);
 System.out.println(--);
 }
 ---
 
 
 Here is the output from the program:
 
 Query: ibm risc tape drive manual
 
 Searching for: ibm risc tape drive manual
 
 4 total matching documents
 
 Score is: 0.16266039 title:null description:null
 
 0.16266039 = product of:
 
 0.20332548 = sum of:
 
 0.03826245 = weight(contents:ibm in 1), product of:
 
 0.31521872 = queryWeight(contents:ibm), product of:
 
 0.7768564 = idf(docFreq=4)
 
 0.40576187 = queryNorm
 
 0.121383816 = fieldWeight(contents:ibm in 1), product of:
 
 1.0 = tf(termFreq(contents:ibm)=1)
 
 0.7768564 = idf(docFreq=4)
 
 0.15625 = fieldNorm(field=contents, doc=1)
 
 0.06340029 = weight(contents:risc in 1), product of:
 
 0.40576187 = queryWeight(contents:risc), product of:
 
 1.0 = idf(docFreq=3)
 
 0.40576187 = queryNorm
 
 0.15625 = fieldWeight(contents:risc in 1), product of:
 
 1.0 = tf(termFreq(contents:risc)=1)
 
 1.0 = idf(docFreq=3)
 
 0.15625 = fieldNorm(field=contents, doc=1)
 
 0.06340029 = weight(contents:tape in 1), product of:
 
 0.40576187 = queryWeight(contents:tape), product of:
 
 1.0 = idf(docFreq=3)
 
 0.40576187 = queryNorm
 
 0.15625 = fieldWeight(contents:tape in 1), product of:
 
 1.0 = tf(termFreq(contents:tape)=1)
 
 1.0 = idf(docFreq=3)
 
 0.15625 = fieldNorm(field=contents, doc=1)
 
 0.03826245 = weight(contents:drive in 1), product of:
 
 0.31521872 = queryWeight(contents:drive), product of:
 
 0.7768564 = idf(docFreq=4)
 
 0.40576187 = queryNorm
 
 0.121383816 = 

Re: Relevance percentage

2004-12-20 Thread Gururaja H
Thanks much for the reply.

Paul Elschot [EMAIL PROTECTED] wrote:On Monday 20 December 2004 15:09, 
Gururaja H wrote:
 Hi,
 
 But, How to calculate the coord() fraction ? I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as overlap / maxOverlap. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched 
document(s) ?

In case you only want the coordination factor to have more influence
in the order of your search results you can use a Similarity with
a coord() function that has a power higher than 1:

public float coord(int overlap, int maxOverlap) {
return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER);
}

I'd first try values between 3.0f and 5.0f for SOME_POWER.

The searching code precomputes all coord values once per query
per search, so there is no need to worry about the computing efficiency.

This has the advantage that the other scoring factors are still used
for ranking.

Since the other factors can vary quite a bit, it is difficult to guarantee
that any coord() implementation will provide a score that sorts by the
number of matching clauses. Higher powers as above can come
a long way, though.

Regards,
Paul Elschot



 Thanks,
 Gururaja
 
 Mike Snare wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the document(s) using 
Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 
4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 
40%
  (doc#1).
  
  Any help on how to go about doing this ?
  
  Thanks,
  Gururaja
  
  
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 All your favorites on one personal page – Try My Yahoo!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Do you Yahoo!?
 Yahoo! Mail - 250MB free storage. Do more. Manage less.

RE: Relevance percentage

2004-12-20 Thread Gururaja H
Thanks much for the reply.

Chuck Williams [EMAIL PROTECTED] wrote:The coord() value is not saved 
anywhere so you would need to recompute
it. You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord(). If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).

I'm on the West Coast of the U.S. so evidently on a very different time
zone from you -- will look at your other message next.

Chuck

 -Original Message-
 From: Gururaja H [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 20, 2004 6:10 AM
 To: Lucene Users List; Mike Snare
 Subject: Re: Relevance percentage
 
 Hi,
 
 But, How to calculate the coord() fraction ? I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as overlap / maxOverlap. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched
 document(s) ?
 
 Thanks,
 Gururaja
 
 Mike Snare wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean
query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the
document(s)
 using Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and
there
 are 4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2),
80%(doc#3)
 and 40%
  (doc#1).
 
  Any help on how to go about doing this ?
 
  Thanks,
  Gururaja
 
 
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 All your favorites on one personal page - Try My Yahoo!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com