Re: Aramorph Analyzer

2004-12-20 Thread Pierrick Brihaye
Hi, Sorry, I (the aramorph maintainer ;-) was absent from the office... Daniel Naber a crit : Analyzers that provide ambiguous terms (i.e. a token with more than one term at the same position) don't work in Lucene 1.4. The is the correct answer. I've filled a bug about this :

Re: Optimising A Security Filter

2004-12-20 Thread Paul Elschot
On Sunday 19 December 2004 23:05, Steve Skillcorn wrote: Hello All; I bought the Lucene in Action ebook, which is excellent and I can strongly recommend. One question that has arisen from the book though is custom filters. I have the situation where the text of my docs is in Lucene,

Relevance percentage

2004-12-20 Thread Gururaja H
How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive)

Number of documents

2004-12-20 Thread Daniel Cortes
I've to show to my boss if Lucene is the best option for create a search engine of a new portal. I want to now how many documents do you have in your index? And how many bigger is your DB? the types of formats who has to support the portal are html jsp txt doc pdf ppt another question that I

Re: Optimising A Security Filter

2004-12-20 Thread Erik Hatcher
Paul already replied, but I'll add my thoughts below to the thread also... On Dec 19, 2004, at 5:05 PM, Steve Skillcorn wrote: I bought the Lucene in Action ebook, which is excellent and I can strongly recommend. Thank you Does the IndexReader that is passed to the “bits” method of the

Re: Number of documents

2004-12-20 Thread Erik Hatcher
On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote: I've to show to my boss if Lucene is the best option for create a search engine of a new portal. I want to now how many documents do you have in your index? And how many bigger is your DB? I highly recommend you use Luke to examine the index. It

Re: Relevance percentage

2004-12-20 Thread Mike Snare
I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H

Re: Relevance percentage

2004-12-20 Thread Gururaja H
Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the

analyzer effecting phrases?

2004-12-20 Thread Peter Posselt Vestergaard
Hi I am building an index of texts, each related to a unique id. The unique ids might contain a number of underscores which will make the standardanalyzer shorten them after it sees the second underscore in a row. Furthermore many of the texts I am indexing is in Italian so the removal of

determination of matching hits

2004-12-20 Thread Christiaan Fluit
Hello all, I have a question regarding the determination of the set of matching documents, in particular (I guess) related to the NOT operator. In my case I have a document containing the terms A and B. When I query for either A or for B, I get this document back, just as expected. Now when I

Queries difference

2004-12-20 Thread Alex Kiselevski
Hello, I want to know is there a difference between queries: +city(+London Amsterdam) +address(1_street 2_street) And +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) Thanks in advance Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 RD, Amdocs -

Re: Queries difference

2004-12-20 Thread Morus Walter
Alex Kiselevski writes: Hello, I want to know is there a difference between queries: +city(+London Amsterdam) +address(1_street 2_street) And +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) I guess you mean city:(... and so on. The first query searches

Re: analyzer effecting phrases?

2004-12-20 Thread Otis Gospodnetic
When searching for phrases, what's important is the position of each token/word extracted by the Analyzer. WhitespaceAnalyzer/LowerCaseFilter don't do anything with the positional information. There is nothing else in your Analyzer? In any case, the following should help you see what your

RE: Queries difference

2004-12-20 Thread Alex Kiselevski
Thanks Morus So if I understand right If the seqond query is : +city(London) +city(Amsterdam) +address(1_street) +address(2_street) Both queries have the same value ? -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:11 PM To: Lucene Users

RE: Queries difference

2004-12-20 Thread Otis Gospodnetic
Alex, I think you want this: +city:London +city:Amsterdam +address:1_street +address:2_street Otis --- Alex Kiselevski [EMAIL PROTECTED] wrote: Thanks Morus So if I understand right If the seqond query is : +city(London) +city(Amsterdam) +address(1_street) +address(2_street) Both

Re: determination of matching hits

2004-12-20 Thread Erik Hatcher
Christian, Please simplify your situation. Use a plain TermQuery for B and see what is returned. Then use a simple BooleanQuery for A -B. I suspect MultiFieldQueryParser is the culprit. What does the toString of the generated Query return? MFQP is known to be trouble, and an overhaul to

sorting on a field that can have null values

2004-12-20 Thread Praveen Peddi
Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. Order by in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null

sorting on a field that can have null values

2004-12-20 Thread Praveen Peddi
Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. Order by in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null

RE: Relevance percentage

2004-12-20 Thread Chuck Williams
The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is

RE: analyzer effecting phrases?

2004-12-20 Thread Peter Posselt Vestergaard
Hi again Thanks for your answer, Otis. My analyzer did not do anything else than using the WhitespaceAnalyzer/LowerCaseFilter. However I found out that I got problems with characters such as ,.: when searching because of my simple analyzer. (E.g. I would not be able to search for world in the

RE: Relevance and ranking ...

2004-12-20 Thread Chuck Williams
I believe your sole problem is that you need to tone down your lengthNorm. Because doc4 is 10 times longer than doc2, its lengthNorm is less than 1/3 of that of doc2 (1/sqrt(10) to be precise). This is a larger effect than the higher coord factor (1/.8) and the extra matching term in doc4. In

Re: Relevance percentage

2004-12-20 Thread Paul Elschot
On Monday 20 December 2004 15:09, Gururaja H wrote: Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) {

Re: analyzer effecting phrases?

2004-12-20 Thread Erik Hatcher
On Dec 20, 2004, at 12:43 PM, Peter Posselt Vestergaard wrote: Therefore I turned back to the standard analyzer and now do some replacing of the underscores in my ID string to avoid my original problem. This solved my phrase problem so that I can now search for phrases. However I still have the

Re: determination of matching hits

2004-12-20 Thread Christiaan Fluit
ok, I feel a bit stupid now ;) Turns out this issue has been discussed a while ago on both mailing lists and I even participated in one of them... shame on me. The problem is indeed in how MFQP parses my query: the query A -B becomes: (text:A -text:B) (title:A -title:B) (path:A -path:B)

RE: determination of matching hits

2004-12-20 Thread Chuck Williams
This is not the official recommendation, but I'd suggest you are least consider: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 If you're not using Java 1.5 and you decide you want to use it, you'd need to take out those dependencies. If you improve it, please share. Chuck

index size doubled?

2004-12-20 Thread aurora
I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below:

RE: Relevance and ranking ...

2004-12-20 Thread Gururaja H
Hi Chuck Williams, Paul Elschot, Thanks so much for the reply. By overriding the coord() as follows, able to get the right order for the example that i gave in this thread. public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap),

Re: Relevance percentage

2004-12-20 Thread Gururaja H
Thanks much for the reply. Paul Elschot [EMAIL PROTECTED] wrote:On Monday 20 December 2004 15:09, Gururaja H wrote: Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as overlap / maxOverlap.

RE: Relevance percentage

2004-12-20 Thread Gururaja H
Thanks much for the reply. Chuck Williams [EMAIL PROTECTED] wrote:The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If