Re: boosting & StandardAnalyzer, stop words

2003-12-10 Thread Stefan Groschupf

Perhaps we'd better continue this on lucene-dev.
 

Ok, i will subscribe this list and request again. 
Thanks!
Stefan
--
open technology: http://www.media-style.com
open source: http://www.weta-group.net
open discussion: http://www.text-mining.org



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Rebuild index?

2003-12-10 Thread Dror Matalon

What version are you running?

Something removed one of the files that Lucene needs while it was using
the index.

Could there have been something running in the background and cleaning
/tmp?


On Wed, Dec 10, 2003 at 01:43:42PM +0200, Igor Semenko wrote:
> Hello,
> 
> We use lucene to search menus, there are around 1 items in
> index and sometimes I see error like this:
> (/tmp/index-menu is index directory)
> java.io.FileNotFoundException: /tmp/index-menu/_6q2.prx (No such file or directory)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.(RandomAccessFile.java:204)
> at 
> org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:389)
> at org.apache.lucene.store.FSInputStream.(FSDirectory.java:418)
> at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:291)
> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:132)
> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:103)
> at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:119)
> at org.apache.lucene.store.Lock$With.run(Lock.java:148)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:110)
> ...
> 
> When I just rebuild the index the problem is gone.
> Could someone hint what can be the reason of such a strange behavior?
> 
> -- 
> Thanks,
> Igor Semenko,
> http://www.webfood.us
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Good and performance and fuzzy search

2003-12-10 Thread julien gerard

Erik Hatcher <[EMAIL PROTECTED]>:

>>
>>You've got quite a tough task ahead of you I think.  You
>>originally 
>>said you wanted to limit documents, which is what a
>>Filter does.  But a 

Ok, I need to have again some english teaching so.

>>FuzzyQuery still needs to go over all the terms,
>>otherwise how would it 
>>know if there was a match or not before even considering
>>the documents.
>>

I don't know, I didn't searched deeply in the code to know the behaviour and the 
meaning of all classes, but apparently I need to do... 

>>It'll be interesting to see what solution you come up
>>with.
>>
>>  Erik

me to :)

Julien.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Good and performance and fuzzy search

2003-12-10 Thread Erik Hatcher
On Wednesday, December 10, 2003, at 05:27  PM, julien gerard wrote:
But in this case the fuzzy is performed on the overall index? The 
QueryFilter do his job after ?
I'm not sure to understand the QueryFilter meaning?

But I test the QueryFilter also this way and the time to doing this 
search it's the same.

The fuzzy is time consuming, this is normal, so I'm searching a 
solution to having less term to compare with fuzzy algorithm.

I'm checking the FuzzyTermEnum class and searching how to redifine 
this to implement a FuzzySubsetTermEnum with constructor :
FuzzySubsetTermEnum(IndexReader reader, Term term, Term subset)

For retrieving only the term which also
You've got quite a tough task ahead of you I think.  You originally 
said you wanted to limit documents, which is what a Filter does.  But a 
FuzzyQuery still needs to go over all the terms, otherwise how would it 
know if there was a match or not before even considering the documents.

It'll be interesting to see what solution you come up with.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Query Parser AND / OR

2003-12-10 Thread Jamie Stallwood
What Morus is saying is right, an expression without parenthesis, when
interpreted, assumes terms on either side of an AND clause are compulsory
terms, and any terms on either side of an OR clause are optional. However,
if you combine AND and OR in an expression, the optional terms have no
effect because the others are compulsory.

What needs to be done is that the query parse should process any query
string that has AND, and "put brackets" round it first. As it stands it is
no use, as the OR does not work in the way you would think. AND should be
given implicit priority.


-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: 10 December 2003 09:01
To: Lucene Users List
Subject: Re: Query Parser AND / OR

Hi Dror,

thanks for your answer.
> >
> > I'm having problems understanding query parsers handling of AND and OR
> > if there's more than one operator.
> >
> > E.g.
> > a OR b AND c
> > gives the same number of hits as
> > b AND c
> > (only scores are different)
>
> This would make sense if all the document that have a also have both B
> and C in them.
>
Then the query should be equivalent to (a OR b) AND c.
But it isn't. For specific a, b and c I get 766 hits for a OR b AND c
and 1086 for (a OR b) AND c.

> >
> > and
> > a AND b OR c AND d
> > seems to be equivalent to
> > a AND b AND C AND d
> >
>

a OR b AND c -> a +b +c
  4 documents found
a b c
a b c d
b c
b c d
(a OR b) AND c -> +(a b) +c
  6 documents found




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Good and performance and fuzzy search

2003-12-10 Thread julien gerard
But in this case the fuzzy is performed on the overall index? The QueryFilter do his 
job after ?
I'm not sure to understand the QueryFilter meaning?

But I test the QueryFilter also this way and the time to doing this search it's the 
same.

The fuzzy is time consuming, this is normal, so I'm searching a solution to having 
less term to compare with fuzzy algorithm. 

I'm checking the FuzzyTermEnum class and searching how to redifine this to implement a 
FuzzySubsetTermEnum with constructor : 
FuzzySubsetTermEnum(IndexReader reader, Term term, Term subset)

For retrieving only the term which also 

Erik Hatcher wrote :

>>
>>QueryFilter would do the trick if you instead used the
>>query you handed 
>>to it to be the one to single out a "sub-category".  It
>>would limit the 
>>documents searched to just the sub-category, and the
>>fuzzy search would 
>>be done using IndexSearcher.search, only handing it the
>>filter then as 
>>well.
>>
>>Will this scheme work for you?
>>
>>  Erik
>>
>>
>>-
>>To unsubscribe, e-mail:
>>[EMAIL PROTECTED]
>>For additional commands, e-mail:
>>[EMAIL PROTECTED]
>>
>>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Good and performance and fuzzy search

2003-12-10 Thread Erik Hatcher
On Wednesday, December 10, 2003, at 04:07  PM, julien gerard wrote:
I'm attempting to optimize a fuzzy search on a big index with 
~4.400.000 Documents ( lucene's meanning ) in 600.000 sub-categories 
(Simple Text.Keyword type a field ).

My purpose is to limit the amount of documents on wich the fuzzy 
search with levenhstein disance is performed ( an user cannot search 
on the 600.000 sub-categories but on 1 to 3 max )

the classics lucenes ways to do that are not adapted to my case :
- multiple indexes : having 600.000 indexes is a nightmare for 
maintenance.
- QueryFilter is not adapted because it's the fuzzy search which is in 
The QueryFilter and the number of different request is too important, 
so I cannot reuse the same.
- The BooleanQuery with 'AND' parameter is also not adapted because 
the two search are executed and after the results are merged.

QueryFilter would do the trick if you instead used the query you handed 
to it to be the one to single out a "sub-category".  It would limit the 
documents searched to just the sub-category, and the fuzzy search would 
be done using IndexSearcher.search, only handing it the filter then as 
well.

Will this scheme work for you?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Good and performance and fuzzy search

2003-12-10 Thread julien gerard
Hi,

I'm attempting to optimize a fuzzy search on a big index with ~4.400.000 Documents ( 
lucene's meanning ) in 600.000 sub-categories (Simple Text.Keyword type a field ).

My purpose is to limit the amount of documents on wich the fuzzy search with 
levenhstein disance is performed ( an user cannot search on the 600.000 sub-categories 
but on 1 to 3 max )

the classics lucenes ways to do that are not adapted to my case :
- multiple indexes : having 600.000 indexes is a nightmare for maintenance.
- QueryFilter is not adapted because it's the fuzzy search which is in The QueryFilter 
and the number of different request is too important, so I cannot reuse the same.
- The BooleanQuery with 'AND' parameter is also not adapted because the two search are 
executed and after the results are merged.

So ( Ah!!! ) my first question is :
is there any way to do fuzzy search on a subset of the index that I've not seen yet?

Is this solution does not exist? Which solution could I implemented to perform this 
kind of search? 
I could implemented a FuzzyFilter but I'll need to access to each document, wich is 
time consuming. 

I know that solution cost a lot of memory usage, which has already been discuted on 
this list, but in my case this way is the only I can see to decrease the execution 
time.

regards,
Julien.

PS. : Sorry for my poor english.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query expansion

2003-12-10 Thread Ralph
How do you model/store your taxonomies/ontologies regarding your
datastructure ? Do you use Java datastructures or RDF? 

Cheers,
Ralf

> Hi Everybody,
> 
> I wish to use an hierarchy of concept provided by an Ontology to refine
> or expand my query answer with Lucene.
> May I Know If someone have tryed it yet ?
> 
> Thanks,
> Gayo
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query expansion

2003-12-10 Thread ambiesense
Hi,

expanding a query is basically done by generating a new one an reusing the
existing terms plus the selected one from your ontology/taxonomy. There has
been discussion here before and you should search the archive for that.
Extracting and using the right bit from your ontology is basically a task for your
programm logic and highly depends on your reasoning and choice.

Cheers,
Ralf 


> Hi Everybody,
> 
> I wish to use an hierarchy of concept provided by an Ontology to refine
> or expand my query answer with Lucene.
> May I Know If someone have tryed it yet ?
> 
> Thanks,
> Gayo
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Rebuild index?

2003-12-10 Thread Igor Semenko
Hello,

We use lucene to search menus, there are around 1 items in
index and sometimes I see error like this:
(/tmp/index-menu is index directory)
java.io.FileNotFoundException: /tmp/index-menu/_6q2.prx (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:204)
at 
org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:389)
at org.apache.lucene.store.FSInputStream.(FSDirectory.java:418)
at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:291)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:132)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:103)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:119)
at org.apache.lucene.store.Lock$With.run(Lock.java:148)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:110)
...

When I just rebuild the index the problem is gone.
Could someone hint what can be the reason of such a strange behavior?

-- 
Thanks,
Igor Semenko,
http://www.webfood.us


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query expansion

2003-12-10 Thread Gayo Diallo
Hi Everybody,

I wish to use an hierarchy of concept provided by an Ontology to refine
or expand my query answer with Lucene.
May I Know If someone have tryed it yet ?
Thanks,
Gayo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Parser AND / OR

2003-12-10 Thread Morus Walter
Hi Dror,

thanks for your answer.
> > 
> > I'm having problems understanding query parsers handling of AND and OR
> > if there's more than one operator.
> > 
> > E.g.
> > a OR b AND c 
> > gives the same number of hits as
> > b AND c
> > (only scores are different)
> 
> This would make sense if all the document that have a also have both B
> and C in them.
> 
Then the query should be equivalent to (a OR b) AND c.
But it isn't. For specific a, b and c I get 766 hits for a OR b AND c
and 1086 for (a OR b) AND c.

> > 
> > and 
> > a AND b OR c AND d
> > seems to be equivalent to
> > a AND b AND C AND d
> > 
> 
> That's not what I get. 
> http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+AND+clark+AND+gephardt&days=
> returns 479 items
> but
> http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+OR+clark+AND+gephardt&days=
> returns 564 items which indicates that the OR does make a difference.
> As expcted, you end up getting more items with the OR.
> 
Hmm. I was sloppy not specifying the lucene version.
My tests were on 1.2.
But I reindex a part of my documents using 1.3rc3 and find the same.
What version does fastbuzz use?

I wrote s small test programm indexing all documents consisting of
one or zero occurences of a, b, c and d (ignoring order, so without
the empty document, that's just 15 docs) and performing some queries
on it.
Programm see below, this is what I get:

a OR b AND c -> a +b +c
  4 documents found
a b c
a b c d
b c
b c d
(a OR b) AND c -> +(a b) +c
  6 documents found
a b c
a b c d
a c
b c
a c d
b c d
a OR (b AND c) -> a (+b +c)
  10 documents found
a b c
a b c d
b c
a
b c d
a b
a c
a d
a b d
a c d
b AND c -> +b +c
  4 documents found
b c
a b c
b c d
a b c d
a AND b OR c AND d -> +a +b +c +d
  1 documents found
a b c d
(a AND b) OR (c AND d) -> (+a +b) (+c +d)
  7 documents found
a b c d
a b
c d
a b c
a b d
a c d
b c d
a AND (b OR c) AND d -> +a +(b c) +d
  3 documents found
a b c d
a b d
a c d
((a AND b) OR c) AND d -> +((+a +b) c) +d
  5 documents found
a b c d
a b d
c d
a c d
b c d
a AND (b OR (c AND d)) -> +a +(b (+c +d))
  5 documents found
a b c d
a c d
a b
a b c
a b d
a AND b AND c AND d -> +a +b +c +d
  1 documents found
a b c d

Using 1.3rc3, 1.3rc2 or 1.3rc1; I get the same results with a slightly 
different order for 1.2.

So I still get the same for 
a OR b AND c  and  b AND c
and
a AND b OR c AND d  and  a AND b AND c AND d
(note, that the result of the toString method of the query is equal in
both cases) 
but different results for any operator grouping, I can think of.
So to me, the question remains, what does AND and OR mean, if they are
combined in one expression?
I can understand all the query results where AND and OR queries are
explicitly grouped by paranthesis, and the results are, what I expect.
But the rules for combined AND and OR aren't what I would expect.

greetings
Morus

PS: the test program:
  
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.QueryParser;

class LuceneTest 
{
static String[] docs = {
"a", "b", "c", "d", 
"a b", "a c", "a d", "b c", "b d", "c d", 
"a b c", "a b d", "a c d", "b c d", 
"a b c d"
};

static String[] queries = {
"a OR b AND c",
"(a OR b) AND c",
"a OR (b AND c)",
"b AND c",
"a AND b OR c AND d",
"(a AND b) OR (c AND d)",
"a AND (b OR c) AND d",
"((a AND b) OR c) AND d",
"a AND (b OR (c AND d))",
"a AND b AND c AND d"
};

public static void main(String argv[]) throws Exception {
Directory dir = new RAMDirectory();
String[] stop = {};
Analyzer analyzer = new StandardAnalyzer(stop);

IndexWriter writer = new IndexWriter(dir, analyzer, true);

for ( int i=0; i < docs.length; i++ ) {
Document doc = new Document();
doc.add(Field.Text("text", docs[i]));
writer.addDocument(doc);
}
writer.close();

Searcher searcher = new IndexSearcher(dir);
for ( int i=0; i < queries.length; i++ ) {
Query query = QueryParser.parse(queries[i], "text", analyzer);
System.out.println(queries[i] + " -> " + query.toString("text"));
Hits hits = searcher.search(query);
System.out.println("  " + hits.length() + " d