Re: Dates and others

2003-12-01 Thread Tatu Saloranta
On Monday 01 December 2003 15:13, Dion Almaer wrote:
...
> Interesting.  I implemented an approach which boosted based on the number
> of months in the past, and after tweaking the boost amounts, it seems to do
> the job. I do a fresh reindex every night (since the indexing process takes
> no time at all... unlike our old search solution!)

This sounds interesting, as I have been thinking of what's the best way
to boost newer documents. Can you share some of your experience regarding 
boost values that seemed to make sense? In my case, CMS I'm working on stores 
support documentation for software/hardware, meaning that content is highly 
time-sensitive (ie. documents "decay" pretty quickly).

Since the system is already doing both incremental reindexing, and nightly 
full reindexing (latter to make sure that even if temporarily some changed 
content was not [fully] reindexed, it eventually gets indexed properly), I 
can fairly easily add boosting I think.

On a related note, it would also be nice if there was a way to start 
categorizing general "hot topics" for Lucene developers; it seems like there 
are about half a dozen areas where there's lots of interest for improvements 
(most of them related to ranking). If so, perhaps there could be more 
specific discussion groups, and also perhaps web pages summarizing some of 
discussions, consensus achieved, even if there's no code to show for it?

-+ Tatu +-

>
> I read content for the index from different sources. Sometimes the source
> gives me documents loosely in date order, but not all of them. So, it seems
> that one of the other approaches should be taken (adding a month/week field
> etc).  I should look more into the HitCollector and see how it can help me.
>
> The other issue I have is that I would like to prioritize the title field. 
> At the moment I am lazy and add the title to the body (contents = title +
> body) which seems to be OK... however sometimes something that mentions the
> search term in the title should appear higher up in the pecking order.
>
> I am using the QueryParser (subclassed to disallow wildcards etc) to do the
> dirty work for me. Should I get away from this and manage the queries
> myself (and run a Multi against the title field as well as the contents?
>
> Thanks for the great feedback,
>
> Dion
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Erik Hatcher
Also, reindex with the new API as well.  There are likely  
incompatibilities in the index format.

On Monday, December 1, 2003, at 11:21  AM, Iain Young wrote:

Note, that I've just tried the example webapp supplied with Lucene,  
and I
appear to be having exactly the same problem with that. The 1.2 version
works ok, but the 1.3 version is displaying a path not found error.

Are there any known incompatibilities with certain versions of Tomcat  
(I'm
currently using version 4.0.3)

Thanks,
Iain
-Original Message-
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: 01 December 2003 15:40
To: '[EMAIL PROTECTED]'
Subject: Help with Searching indexes from a web app (Lucene 1.3 rc2)

Hi folks.

I'm new to Lucene so this may be an obvious questions, but I am having
problems with Lucene 1.3-rc2. I've got a bit of code which looks  
something
like this

public static void getSearchResults(String searchString, String  
indexDir)
{
try
{
Searcher searcher = new IndexSearcher(indexDir);
.
etc...
.
}
catch (Exception ex)
{
}
}

I'm calling it to from a web application (servlet) running in tomcat  
in
conjunction with struts and velocity. If I use the Lucene 1.2 binary
release, it all works fine and I get the search results ok. However,  
when
I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all  
of my
code exactly the same) it stops working, and I get a path not found
exception being thrown.

I've narrowed it down to the IndexReader.open(final Directory  
directory)
method. Even if I pass a valid Directory object into this (created by
FSDirectory), it just seems to throw the exception, (even though I  
know
the directory object is not null etc). The bizarre thing is that this
problem only seems to occur when I run it from the web application.  
If I
invoke the same code from the command line, it works ok, (even though  
I'm
using the same string for the index dir).

Anyone got any ideas? (I want to use 1.3 because I want to exploit  
some of
the newer features). Does running from within a web application do
something strange with the paths, even though the strings I'm using  
are
fully qualified?

Thanks for your help,

Iain Young
http://www.microfocus.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
___ 
_
This e-mail has been scanned for viruses by MCI's Internet Managed  
Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com
___ 
_

___ 
_
This e-mail has been scanned for viruses by MCI's Internet Managed  
Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com
___ 
_

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Dror Matalon
So, the lock is set, the segments file is opened, all the files in the
segments file are opened and then the lock is released? Is that correct?
And we're relying on the OS to keep the file handles around even if the
files are deleted under us? If so, I'm impressed that this is portable.

Dror

On Mon, Dec 01, 2003 at 02:10:55PM -0800, Doug Cutting wrote:
> Kevin A. Burton wrote:
> >Would there be any performance improvement in query throughput and 
> >latency if locking were disabled for readonly indexes?
> 
> The locks are only consulted when opening a new IndexReader.  I doubt 
> very much that you're doing this often enough for this to be significant.
> 
> Doug
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Dror Matalon
Let us know what you find out. I would guess that the gains are not
going to be that spectacular. Creating and deleting files should be
"cheap" operations in a modern OS. Mainly when you compare these to the
costs of openning a new index and populating various caches.
But let us know what you find.

Regards,

Dror

On Mon, Dec 01, 2003 at 02:40:36PM -0800, Kevin A. Burton wrote:
> Dror Matalon wrote:
> 
> >>I would assume that removing this lock could increase performance 
> >>especially to allow multiple concurrent searches on the same data.
> >>   
> >>
> >
> >There was talk about providing that in an upcoming version. Until then
> >you can try RODirectory:
> > http://www.csita.unige.it/software/free/lucene/
> > 
> >
> Looking at the source of FSDirectory it seem easy enough to add a 
> property to disable indexes. 
> 
> This way if you wanted to search a directory and you new it was 100% 
> necessary to search it readonly you could go ahead and do it without 
> having to write lock files. 
> 
> I will write some unit tests to see what the performance is here.  It 
> should be trivial to create an index of a few hundred megs and then run 
> about 200k queries across it with 30 threads or so... then disable locks 
> and see what the total time spent was...
> 
> If it's substantial I think it makes sense to make this contribution :)
> 
> Kevin
> 
> -- 
>NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>   AIM - sfburtonator,  Web - http://www.peerfear.org/
> GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Kevin A. Burton
Dror Matalon wrote:

I would assume that removing this lock could increase performance 
especially to allow multiple concurrent searches on the same data.
   

There was talk about providing that in an upcoming version. Until then
you can try RODirectory:
	http://www.csita.unige.it/software/free/lucene/
 

Looking at the source of FSDirectory it seem easy enough to add a 
property to disable indexes. 

This way if you wanted to search a directory and you new it was 100% 
necessary to search it readonly you could go ahead and do it without 
having to write lock files. 

I will write some unit tests to see what the performance is here.  It 
should be trivial to create an index of a few hundred megs and then run 
about 200k queries across it with 30 threads or so... then disable locks 
and see what the total time spent was...

If it's substantial I think it makes sense to make this contribution :)

Kevin

--
   NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Kevin A. Burton
Dror Matalon wrote:

On Mon, Dec 01, 2003 at 01:38:48PM -0800, Kevin A. Burton wrote:
 

Would there be any performance improvement in query throughput and 
latency if locking were disabled for readonly indexes?

It doesnt' seem like it makes sense to worry about locking if you know 
for SURE that the index will NEVER be updated again. 

I'm noticing this problem now.  We are running a live indexer which does 
a commit every N documents (right now 100,000) and then swaps the new 
index into the system live.  This index is never again updated and we 
use a multisearcher.  We then do index merges after a while into new 
indexes to keep performance high and reduce the number of indexes.
   

Sounds quite familiar. One question though, when you say "swaps" the new
index, what do you mean? It's one area where locking might matter. If
you just use a multisearcher and add the new index I'm guessing that it
should work fine.
It's a safe operation.. It does a directory rename and then is added to 
the multi-searcher. It's sychronized.. and 100% safe ;)

I would assume that removing this lock could increase performance 
especially to allow multiple concurrent searches on the same data.
   

There was talk about providing that in an upcoming version. Until then
you can try RODirectory:
	http://www.csita.unige.it/software/free/lucene/
 

Cool.. thanks.

--
   NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: Dates and others

2003-12-01 Thread Doug Cutting
Dion Almaer wrote:
Interesting.  I implemented an approach which boosted based on the number of months in 
the past, and
after tweaking the boost amounts, it seems to do the job. I do a fresh reindex every 
night (since
the indexing process takes no time at all... unlike our old search solution!)
If you're reindexing every night, then document boosting should work 
well.  The other approaches I mentioned would only be required if you 
can't afford to re-index frequently.

I read content for the index from different sources. Sometimes the source gives me 
documents loosely
in date order, but not all of them. So, it seems that one of the other approaches 
should be taken
(adding a month/week field etc).  I should look more into the HitCollector and see how 
it can help
me.
I wouldn't bother to pursue these approaches.  Document boosting should 
work well for you, since you're reindexing.

The other issue I have is that I would like to prioritize the title field.  At the 
moment I am lazy
and add the title to the body (contents = title + body) which seems to be OK... 
however sometimes
something that mentions the search term in the title should appear higher up in the 
pecking order.
I am using the QueryParser (subclassed to disallow wildcards etc) to do the dirty work 
for me.
Should I get away from this and manage the queries myself (and run a Multi against the 
title field
as well as the contents?
A separate title field will solve this for you.  You can, as you 
suggest, boost title clauses at query time.  Alternately, you could 
boost title fields at index time, although that's less flexible.

Note that if you put titles in a separate field and search both the 
title and the body field then title matches will tend to be naturally 
boosted, since titles tend to be shorter than bodies and hence are less 
normalized.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Dates and others

2003-12-01 Thread Dion Almaer
 

> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED] 
> Sent: Monday, December 01, 2003 1:11 PM
> To: Lucene Users List
> Subject: Re: Dates and others
> 
> Dion Almaer wrote:
> > The only real item that I still want to tweak more is 
> getting recent results higher in the list.
> > 
> > I was wondering if something like this could work (or if there is a 
> > better solution)
> > 
> > At index time, I have the date of the content.  I could do 
> some math 
> > where the higher the date (based on the time_t version or whatever) 
> > the more of a setBoost(metric). Or, for every month in the 
> past, create a larger negative number to setBoost()... or 
> something like that.
> > 
> > Would something like this make sense?
> 
> The problem with this approach is that eventually you'll 
> exhaust the range of the boost.  So this will only work if 
> you re-index things from scratch periodically, with a boost 
> of something like 1/days-ago.
> 
> If you're adding documents to the index in date order, then 
> you could use a HitCollector which adjusts scores according 
> to the document number, since document numbers increase as 
> you add to the index.
> 
> If you're not adding things in date order, then you can, when 
> you open the index, build an array mapping document numbers 
> to integer dates. 
> Then your hit collector can use this to either boost or sort 
> hits by date.
> 
> Or you could add a "month" or "week" field to documents, then 
> add it as a clause to your queries with a boost.  Then 
> documents matching the most recent week(s) and/or month(s) 
> would get the boost.
> 
> Doug

Interesting.  I implemented an approach which boosted based on the number of months in 
the past, and
after tweaking the boost amounts, it seems to do the job. I do a fresh reindex every 
night (since
the indexing process takes no time at all... unlike our old search solution!)

I read content for the index from different sources. Sometimes the source gives me 
documents loosely
in date order, but not all of them. So, it seems that one of the other approaches 
should be taken
(adding a month/week field etc).  I should look more into the HitCollector and see how 
it can help
me.

The other issue I have is that I would like to prioritize the title field.  At the 
moment I am lazy
and add the title to the body (contents = title + body) which seems to be OK... 
however sometimes
something that mentions the search term in the title should appear higher up in the 
pecking order.

I am using the QueryParser (subclassed to disallow wildcards etc) to do the dirty work 
for me.
Should I get away from this and manage the queries myself (and run a Multi against the 
title field
as well as the contents?

Thanks for the great feedback,

Dion


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Doug Cutting
Kevin A. Burton wrote:
Would there be any performance improvement in query throughput and 
latency if locking were disabled for readonly indexes?
The locks are only consulted when opening a new IndexReader.  I doubt 
very much that you're doing this often enough for this to be significant.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Dror Matalon
On Mon, Dec 01, 2003 at 01:38:48PM -0800, Kevin A. Burton wrote:
> Would there be any performance improvement in query throughput and 
> latency if locking were disabled for readonly indexes?
> 
> It doesnt' seem like it makes sense to worry about locking if you know 
> for SURE that the index will NEVER be updated again. 
> 
> I'm noticing this problem now.  We are running a live indexer which does 
> a commit every N documents (right now 100,000) and then swaps the new 
> index into the system live.  This index is never again updated and we 
> use a multisearcher.  We then do index merges after a while into new 
> indexes to keep performance high and reduce the number of indexes.

Sounds quite familiar. One question though, when you say "swaps" the new
index, what do you mean? It's one area where locking might matter. If
you just use a multisearcher and add the new index I'm guessing that it
should work fine.

> 
> I would assume that removing this lock could increase performance 
> especially to allow multiple concurrent searches on the same data.

There was talk about providing that in an upcoming version. Until then
you can try RODirectory:
http://www.csita.unige.it/software/free/lucene/


Regards,

Dror

> 
> Kevin
> 
> -- 
>NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>   AIM - sfburtonator,  Web - http://www.peerfear.org/
> GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



disable locks on read only indexes (performance improvement?)

2003-12-01 Thread Kevin A. Burton
Would there be any performance improvement in query throughput and 
latency if locking were disabled for readonly indexes?

It doesnt' seem like it makes sense to worry about locking if you know 
for SURE that the index will NEVER be updated again. 

I'm noticing this problem now.  We are running a live indexer which does 
a commit every N documents (right now 100,000) and then swaps the new 
index into the system live.  This index is never again updated and we 
use a multisearcher.  We then do index merges after a while into new 
indexes to keep performance high and reduce the number of indexes.

I would assume that removing this lock could increase performance 
especially to allow multiple concurrent searches on the same data.

Kevin

--
   NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: AW: AW: Real Boolean Model in Lucene?

2003-12-01 Thread Doug Cutting
Karsten Konrad wrote:
Now hell would be the place for me where I would have to prove that Lucene's ranking is 
exactly equivalent to some transformation of vector space and then using the *cosine* for the 
ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some ruthless
normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough corners
in Lucene where a good theorist could find some work.
One problem with computing the theoretically correct normalization is 
that it changes each time the index is modified, and must be recomputed 
for every document.  This is because it includes IDFs, and, when the 
index changes, IDFs change.  That's when you normalize tf/idf vectors to 
the unit sphere, i.e., norm=sqrt(sum_t(weight(t)^2)).

But research has also shown that this mathematically correct 
normalization is not the best, e.g. Singhal et. al.'s work on pivoted 
normalization:

  http://citeseer.ist.psu.edu/singhal96pivoted.html

More generally, this shows that information retreival is not a 
theoretical field, but rather a heuristic one.  (Although someone may 
flame me for saying that...)  The vector space model is just a useful 
analogy, not a verifiable theory of document meaning.  It suggests some 
formulae which can be improved through inspiration and experimentation.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to change similarity measure...

2003-12-01 Thread Ype Kingma
On Monday 01 December 2003 05:38, Ralph wrote:
> Hi,
>
> does somebody has an example of how to use another similarity class
> implementation for searching? Assuming I have implemented MySimilarity
>
> class MySimilarity implements Similarity{ 
>
> how do I have to plug it in to acutally use it for a search in a way that I
> have to program as least as possible :-) ?

Try inheriting from DefaultSimilarity.

Have fun,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Real Boolean Model in Lucene?

2003-12-01 Thread Ype Kingma
Ralph,

On Monday 01 December 2003 04:11, [EMAIL PROTECTED] wrote:
> Hi,
>
> is it possible to use a real boolean model in lucene for searching. When
> one is using the Queryparser with a boolean query (i.e. "dog AND horse")
> one does get a list of documents from the Hits object. However these
> documents have a ranking (score).
>
> My Question: Does Lucene use TF/IDF for getting this? (which would mean it
> does not use the boolean model for the boolean query...)
>
> How can one use a boolean model search, where the outcome are all score=1 ?
> Example?

You can use the low level scoring API, which simply enumerates all boolean
hits. It also gives you the score, but you can just ignore that if you want.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searcher.html
Use the search() method with a HitCollector, and provide you own HitCollector.

When you do this avoid retrieving documents during the search, retrieve docs
afterwards. Retrieving docs during search would cause unwanted disk head seeks
from the terms index to the stored fields and back.
It is also preferable to retrieve docs in the order that you collect them, ie. 
independent
of the score.
Even so, retrieving  documents normally takes more time than collecting them.

Kind regards,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Dror Matalon
Please, always include the full exception when reporting a problem.

As for your problem, I would guess that it has to do with where Lucene
puts its temp files. I think that something changed between 1.2 and 1.3
and on Unix it started using /tmp instead of the index directory for the
tmp files. So somehow from your webapp you don't have permissions to
write /tmp. Are you running a security manager, by any chance?

Regards,

Dror

On Mon, Dec 01, 2003 at 04:21:04PM -, Iain Young wrote:
> Note, that I've just tried the example webapp supplied with Lucene, and I
> appear to be having exactly the same problem with that. The 1.2 version
> works ok, but the 1.3 version is displaying a path not found error.
> 
> Are there any known incompatibilities with certain versions of Tomcat (I'm
> currently using version 4.0.3)
> 
> Thanks,
> Iain
> 
> -Original Message-
> From: Iain Young [mailto:[EMAIL PROTECTED]
> Sent: 01 December 2003 15:40
> To: '[EMAIL PROTECTED]'
> Subject: Help with Searching indexes from a web app (Lucene 1.3 rc2)
> 
> 
> > Hi folks.
> > 
> > I'm new to Lucene so this may be an obvious questions, but I am having
> > problems with Lucene 1.3-rc2. I've got a bit of code which looks something
> > like this
> > 
> > public static void getSearchResults(String searchString, String indexDir)
> > {
> > try
> > {
> > Searcher searcher = new IndexSearcher(indexDir);
> > .
> > etc...
> > .
> > }
> > catch (Exception ex)
> > {
> > }
> > }
> > 
> > I'm calling it to from a web application (servlet) running in tomcat in
> > conjunction with struts and velocity. If I use the Lucene 1.2 binary
> > release, it all works fine and I get the search results ok. However, when
> > I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all of my
> > code exactly the same) it stops working, and I get a path not found
> > exception being thrown. 
> > 
> > I've narrowed it down to the IndexReader.open(final Directory directory)
> > method. Even if I pass a valid Directory object into this (created by
> > FSDirectory), it just seems to throw the exception, (even though I know
> > the directory object is not null etc). The bizarre thing is that this
> > problem only seems to occur when I run it from the web application. If I
> > invoke the same code from the command line, it works ok, (even though I'm
> > using the same string for the index dir).
> > 
> > Anyone got any ideas? (I want to use 1.3 because I want to exploit some of
> > the newer features). Does running from within a web application do
> > something strange with the paths, even though the strings I'm using are
> > fully qualified?
> > 
> > Thanks for your help,
> > 
> > Iain Young
> > http://www.microfocus.com
> > 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
> Services - powered by MessageLabs. For further information visit
> http://www.mci.com
> 
> 
> 
> This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
> Services - powered by MessageLabs. For further information visit
> http://www.mci.com
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Dates and others

2003-12-01 Thread Chong, Herb
ad hoc techniques run into lots of trouble because the requirement on Lucene isn't 
well specified. is a document with one of the search terms that is a week newer enough 
to move it ahead of a document that has all of the search terms? the boost mechanism 
is a way to move documents around in the ranking list, but it clearly is a way to 
reweight the importance of the query terms and not to impose external constraints that 
properly should be handled outside the search engine.

Herb...

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, December 01, 2003 1:11 PM
To: Lucene Users List
Subject: Re: Dates and others

The problem with this approach is that eventually you'll exhaust the 
range of the boost.  So this will only work if you re-index things from 
scratch periodically, with a boost of something like 1/days-ago.

If you're adding documents to the index in date order, then you could 
use a HitCollector which adjusts scores according to the document 
number, since document numbers increase as you add to the index.

If you're not adding things in date order, then you can, when you open 
the index, build an array mapping document numbers to integer dates. 
Then your hit collector can use this to either boost or sort hits by date.

Or you could add a "month" or "week" field to documents, then add it as 
a clause to your queries with a boost.  Then documents matching the most 
recent week(s) and/or month(s) would get the boost.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dates and others

2003-12-01 Thread Doug Cutting
Dion Almaer wrote:
The only real item that I still want to tweak more is getting recent results higher in the list.

I was wondering if something like this could work (or if there is a better solution)

At index time, I have the date of the content.  I could do some math where the higher 
the date
(based on the time_t version or whatever) the more of a setBoost(metric). Or, for 
every month in the
past, create a larger negative number to setBoost()... or something like that.
Would something like this make sense?
The problem with this approach is that eventually you'll exhaust the 
range of the boost.  So this will only work if you re-index things from 
scratch periodically, with a boost of something like 1/days-ago.

If you're adding documents to the index in date order, then you could 
use a HitCollector which adjusts scores according to the document 
number, since document numbers increase as you add to the index.

If you're not adding things in date order, then you can, when you open 
the index, build an array mapping document numbers to integer dates. 
Then your hit collector can use this to either boost or sort hits by date.

Or you could add a "month" or "week" field to documents, then add it as 
a clause to your queries with a boost.  Then documents matching the most 
recent week(s) and/or month(s) would get the boost.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Iain Young
Note, that I've just tried the example webapp supplied with Lucene, and I
appear to be having exactly the same problem with that. The 1.2 version
works ok, but the 1.3 version is displaying a path not found error.

Are there any known incompatibilities with certain versions of Tomcat (I'm
currently using version 4.0.3)

Thanks,
Iain

-Original Message-
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: 01 December 2003 15:40
To: '[EMAIL PROTECTED]'
Subject: Help with Searching indexes from a web app (Lucene 1.3 rc2)


> Hi folks.
> 
> I'm new to Lucene so this may be an obvious questions, but I am having
> problems with Lucene 1.3-rc2. I've got a bit of code which looks something
> like this
> 
> public static void getSearchResults(String searchString, String indexDir)
> {
> try
> {
> Searcher searcher = new IndexSearcher(indexDir);
> .
> etc...
> .
> }
> catch (Exception ex)
> {
> }
> }
> 
> I'm calling it to from a web application (servlet) running in tomcat in
> conjunction with struts and velocity. If I use the Lucene 1.2 binary
> release, it all works fine and I get the search results ok. However, when
> I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all of my
> code exactly the same) it stops working, and I get a path not found
> exception being thrown. 
> 
> I've narrowed it down to the IndexReader.open(final Directory directory)
> method. Even if I pass a valid Directory object into this (created by
> FSDirectory), it just seems to throw the exception, (even though I know
> the directory object is not null etc). The bizarre thing is that this
> problem only seems to occur when I run it from the web application. If I
> invoke the same code from the command line, it works ok, (even though I'm
> using the same string for the index dir).
> 
> Anyone got any ideas? (I want to use 1.3 because I want to exploit some of
> the newer features). Does running from within a web application do
> something strange with the paths, even though the strings I'm using are
> fully qualified?
> 
> Thanks for your help,
> 
> Iain Young
> http://www.microfocus.com
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

2003-12-01 Thread Tun Lin
Hi Che Dong,

The install.txt that you have in the package, the part on preparing the
environment, can you include the setup for windows because I think what you
wrote in install.txt is for UNIX setup? I still cannot get my system working.
Please help.

Thanks. 

-Original Message-
From: Che Dong [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 01, 2003 4:21 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: WebLucene 0.3 release:support CJK, use sax based indexing, docID
based result sorting and xml format output with highlighting support.

build..properties.default 

# -
# WebLucene  BUILD  PROPERTIES
# -
jsdk_jar=/usr/local/resin/lib/jsdk23.jar

# Home directory of JavaCC
javacc.home = /usr/java/javacc/bin

# modify following on Windows
# jsdk_jar=c:\\resin\\lib\\jsdk23.jar
# javacc.home = c:\\java\\javacc\\bin


javacc.zip.dir = ${javacc.home}/lib
javacc.zip = ${javacc.zip.dir}/JavaCC.zip

Che, Dong
- Original Message -
From: "Tun Lin" <[EMAIL PROTECTED]>
To: "'Lucene Developers List'" <[EMAIL PROTECTED]>; "'Lucene Users
List'" <[EMAIL PROTECTED]>
Sent: Monday, December 01, 2003 11:34 AM
Subject: RE: WebLucene 0.3 release:support CJK, use sax based indexing, docID
based result sorting and xml format output with highlighting support.


> Hi,
> 
> Do you have the install.txt for windows XP setup of the WebLucene? It seems
that
> the install.txt is only for UNIX setup.
> 
> Thanks.  
> 
> -Original Message-
> From: Che Dong [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, November 30, 2003 9:57 PM
> To: Lucene Developers List; Lucene Users List
> Subject: WebLucene 0.3 release:support CJK, use sax based indexing, docID
based
> result sorting and xml format output with highlighting support.
> 
> http://sourceforge.net/projects/weblucene/
> 
> WebLucene: 
> Lucene search engine XML interface, provided sax based indexing, indexing
> sequence based result sorting and xml output with highlight support.The
> CJKTokenizer support Chinese Japanese and Korean with Westen language
> simultaneously.
> 
> The key features:
> 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer
> 
> 2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher
> 
> 3 xml output: com/chedong/weblucene/search/DOMSearcher
> 
> 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
> 
> 5 token based highlighter: 
> reverse StopTokenzier:
> org/apache/lucene/anlysis/HighlightAnalyzer.java
>   HighlightFilter.java
> with abstract:
> com/chedong/weblucene/search/WebluceneHighlighter
> 
> 6 A simplified query parser:
> google like syntax with term limit
> org/apache/lucene/queryParser/SimpleQueryParser
> modified from early version of Lucene :)
> 
> Regards
> 
> Che, Dong
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Iain Young
> Hi folks.
> 
> I'm new to Lucene so this may be an obvious questions, but I am having
> problems with Lucene 1.3-rc2. I've got a bit of code which looks something
> like this
> 
> public static void getSearchResults(String searchString, String indexDir)
> {
> try
> {
> Searcher searcher = new IndexSearcher(indexDir);
> .
> etc...
> .
> }
> catch (Exception ex)
> {
> }
> }
> 
> I'm calling it to from a web application (servlet) running in tomcat in
> conjunction with struts and velocity. If I use the Lucene 1.2 binary
> release, it all works fine and I get the search results ok. However, when
> I replace the 1.2 jar file with the 1.3-rc2 jat file,  (leaving all of my
> code exactly the same) it stops working, and I get a path not found
> exception being thrown. 
> 
> I've narrowed it down to the IndexReader.open(final Directory directory)
> method. Even if I pass a valid Directory object into this (created by
> FSDirectory), it just seems to throw the exception, (even though I know
> the directory object is not null etc). The bizarre thing is that this
> problem only seems to occur when I run it from the web application. If I
> invoke the same code from the command line, it works ok, (even though I'm
> using the same string for the index dir).
> 
> Anyone got any ideas? (I want to use 1.3 because I want to exploit some of
> the newer features). Does running from within a web application do
> something strange with the paths, even though the strings I'm using are
> fully qualified?
> 
> Thanks for your help,
> 
> Iain Young
> http://www.microfocus.com
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: AW: Real Boolean Model in Lucene?

2003-12-01 Thread ambiesense
Hello Karsten,

that is fine for me. Implementation cannot 100 % be matched to some theory
as the ISO OSI model has perfectly shown. :-) Thats ok for me and I want to
thank you again for the clarification I gained from this conversation.

Cheers

> 
> Hello Ralf,
> 
> >>
> According to your description, Lucene basically maps the boolean query 
> into the vector space and measures the cosine similarity towards other 
> documents in the vector space. If I understood you correctly you mean if 
> a document is found by Lucene based on a boolean query it is relevant 
> (boolean true). If it is not returned, if was boolean false. The score 
> sits on top of it and can be used for ranking. If I would like to use 
> true boolean model I would therefore just need to ignore the score of 
> the Hits document. Did I understand correctly?
> >>
> 
> Yes, I think that this is indeed pretty close to some theoretical 
> foundation: The Boolean Model 
> explains which documents fit to a query, while some appropriate (Lucene 
> is good!) similarity 
> function in vector space yields the ranking.
> 
> Now hell would be the place for me where I would have to prove that 
> Lucene's ranking is 
> exactly equivalent to some transformation of vector space and then using 
> the *cosine* for the 
> ranking. Can't be really, as Lucene sometimes returns results > 1.0 and 
> only some ruthless
> normalisation keeps it within 0.0 to 1.0. In other words, there still 
> are some rough corners
> in Lucene where a good theorist could find some work.
> 
> Could  we leave this topic aside until some suicid.. err, I mean 
> enthusiastic fellow
> tries to work out a really good theory?
> 
> Regards,
> 
> Karsten
> 
> 
> 
> 
> 
> -Ursprüngliche Nachricht-
> Von: Ralf B [mailto:[EMAIL PROTECTED] 
> Gesendet: Montag, 1. Dezember 2003 14:28
> An: Lucene Users List
> Betreff: Re: AW: Real Boolean Model in Lucene?
> 
> 
> Hi Karsten,
> 
> I want to thank you for your qualified answer as well as your answer 
> >from the 14th of November, where you agreed with me that Lucene is 
> basically a VSM implementation. Sometimes it is difficult to make the 
> link between the clear theory and its implementation.
> 
> According to your description, Lucene basically maps the boolean query 
> into the vector space and measures the cosine similarity towards other 
> documents in the vector space. If I understood you correctly you mean if 
> a document is found by Lucene based on a boolean query it is relevant 
> (boolean true). If it is not returned, if was boolean false. The score 
> sits on top of it and can be used for ranking. If I would like to use 
> true boolean model I would therefore just need to ignore the score of 
> the Hits document. Did I understand correctly?
> 
> I aggree that nobody really want to do that. My question intended to 
> find out more about the implemented theory within Lucene.
> 
> Cheers,
> Ralph
> 
> 
> > 
> > Hi,
> > 
> > >>
> > My Question: Does Lucene use TF/IDF for getting this? (which would 
> > mean
> > it does not use the boolean model for the boolean query...)
> > >>
> > 
> > Lucene indeed uses TF/IDF with length normalization for fields and
> > documents. 
> > 
> > However, Lucene is "downward compatible" to the Boolean Model where 
> > documents are represented as 0/1-vectors in Vector Space. Ranking just 
> 
> > adds weights to the elements of the result set, so the underlying 
> > interpretation of a query result can be still that of a 
> > Propositional/Boolean model. If a document appears in the result, its 
> > tokens valuate the query (which actually is a propositional formula 
> > formed over words and phrases) to true. The representation of 
> > documents is more complex in Lucene than required for the Boolean 
> > Model, and as a result, Lucene can efficiently handle phrases and 
> > proximity searches, but these seem to be compatible extensions - if 
> > you can do it in the Boolean Model, you can do it in Lucene :)
> > 
> > One place where Lucene is not 100% compatible with a basic Boolean 
> > Model
> > is that 
> > full negation is a bit tricky - you can not simply ask for all 
> documents 
> > that 
> > do not contain a certain term unless you also have some term that 
> > appears in all 
> > documents. Not a great deal, really. 
> > 
> > If TF/IDF weighting is a problem to you, the Similarity interface
> > implementation allows you 
> > to remove all references to length normalization and document 
> > frequencies.
> > 
> > Regards,
> > 
> > Mit freundlichen Grüßen aus Saarbrücken
> > 
> > --
> > 
> > Dr.-Ing. Karsten Konrad
> > Head of Artificial Intelligence Lab
> > 
> > XtraMind Technologies GmbH
> > Stuhlsatzenhausweg 3
> > D-66123 Saarbrücken
> > Phone: +49 (681) 3025113
> > Fax: +49 (681) 3025109
> > [EMAIL PROTECTED]
> > www.xtramind.com
> > 
> > 
> > 
> > -Ursprüngliche Nachricht-
> > Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Gesendet: Montag, 1. Dezember 2003 13:11
> > An

AW: AW: Real Boolean Model in Lucene?

2003-12-01 Thread Karsten Konrad

Hello Ralf,

>>
According to your description, Lucene basically maps the boolean query into the vector 
space and measures the cosine similarity towards other documents in the vector space. 
If I understood you correctly you mean if a document is found by Lucene based on a 
boolean query it is relevant (boolean true). If it is not returned, if was boolean 
false. The score sits on top of it and can be used for ranking. If I would like to use 
true boolean model I would therefore just need to ignore the score of the Hits 
document. Did I understand correctly?
>>

Yes, I think that this is indeed pretty close to some theoretical foundation: The 
Boolean Model 
explains which documents fit to a query, while some appropriate (Lucene is good!) 
similarity 
function in vector space yields the ranking.

Now hell would be the place for me where I would have to prove that Lucene's ranking 
is 
exactly equivalent to some transformation of vector space and then using the *cosine* 
for the 
ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some 
ruthless
normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough 
corners
in Lucene where a good theorist could find some work.

Could  we leave this topic aside until some suicid.. err, I mean enthusiastic fellow
tries to work out a really good theory?

Regards,

Karsten





-Ursprüngliche Nachricht-
Von: Ralf B [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 1. Dezember 2003 14:28
An: Lucene Users List
Betreff: Re: AW: Real Boolean Model in Lucene?


Hi Karsten,

I want to thank you for your qualified answer as well as your answer from the 14th of 
November, where you agreed with me that Lucene is basically a VSM implementation. 
Sometimes it is difficult to make the link between the clear theory and its 
implementation.

According to your description, Lucene basically maps the boolean query into the vector 
space and measures the cosine similarity towards other documents in the vector space. 
If I understood you correctly you mean if a document is found by Lucene based on a 
boolean query it is relevant (boolean true). If it is not returned, if was boolean 
false. The score sits on top of it and can be used for ranking. If I would like to use 
true boolean model I would therefore just need to ignore the score of the Hits 
document. Did I understand correctly?

I aggree that nobody really want to do that. My question intended to find out more 
about the implemented theory within Lucene.

Cheers,
Ralph


> 
> Hi,
> 
> >>
> My Question: Does Lucene use TF/IDF for getting this? (which would 
> mean
> it does not use the boolean model for the boolean query...)
> >>
> 
> Lucene indeed uses TF/IDF with length normalization for fields and
> documents. 
> 
> However, Lucene is "downward compatible" to the Boolean Model where 
> documents are represented as 0/1-vectors in Vector Space. Ranking just 
> adds weights to the elements of the result set, so the underlying 
> interpretation of a query result can be still that of a 
> Propositional/Boolean model. If a document appears in the result, its 
> tokens valuate the query (which actually is a propositional formula 
> formed over words and phrases) to true. The representation of 
> documents is more complex in Lucene than required for the Boolean 
> Model, and as a result, Lucene can efficiently handle phrases and 
> proximity searches, but these seem to be compatible extensions - if 
> you can do it in the Boolean Model, you can do it in Lucene :)
> 
> One place where Lucene is not 100% compatible with a basic Boolean 
> Model
> is that 
> full negation is a bit tricky - you can not simply ask for all documents 
> that 
> do not contain a certain term unless you also have some term that 
> appears in all 
> documents. Not a great deal, really. 
> 
> If TF/IDF weighting is a problem to you, the Similarity interface
> implementation allows you 
> to remove all references to length normalization and document 
> frequencies.
> 
> Regards,
> 
> Mit freundlichen Grüßen aus Saarbrücken
> 
> --
> 
> Dr.-Ing. Karsten Konrad
> Head of Artificial Intelligence Lab
> 
> XtraMind Technologies GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Phone: +49 (681) 3025113
> Fax: +49 (681) 3025109
> [EMAIL PROTECTED]
> www.xtramind.com
> 
> 
> 
> -Ursprüngliche Nachricht-
> Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Gesendet: Montag, 1. Dezember 2003 13:11
> An: [EMAIL PROTECTED]
> Betreff: Real Boolean Model in Lucene?
> 
> 
> Hi,
> 
> is it possible to use a real boolean model in lucene for searching. 
> When
> one is using the Queryparser with a boolean query (i.e. "dog AND horse") 
> one does get a list of documents from the Hits object. However these 
> documents have a ranking (score).
> 
> My Question: Does Lucene use TF/IDF for getting this? (which would 
> mean
> it does not use the boolean model for the boolean query...)
> 
> How can one use a boolean model search

Re: New Lucene-powered Website

2003-12-01 Thread Ulrich Mayring
Chong, Herb wrote:
can you share a description of the heuristics you used to clean up the text? i am facing the same problem right now handling email. i'm not interested in the rules you use as much as the tools you use to implement the rules.
The tools... well, Java ;-)

The search engine is a custom Java application, which uses Lucene. The 
heuristics are not very general at this point, they are tailored to our 
domain. So what you are hinting at (a generic rules description language 
to customize to the local domain) seems appropriate. Our rules are 
things like "anything within ... is an important sentence and 
we add a full-stop at the end".

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: New Lucene-powered Website

2003-12-01 Thread Chong, Herb
can you share a description of the heuristics you used to clean up the text? i am 
facing the same problem right now handling email. i'm not interested in the rules you 
use as much as the tools you use to implement the rules.

Herb

-Original Message-
From: Ulrich Mayring [mailto:[EMAIL PROTECTED]
Sent: Friday, November 28, 2003 4:21 AM
To: [EMAIL PROTECTED]
Subject: Re: New Lucene-powered Website

This "clean-up work" is actually trickier than the summarising itself 
and it is usually very domain-specific. That's the reason why I haven't 
proposed to contribute the summariser to Lucene, because the clean-up 
code is not generic. The summariser itself is just one class with 300 
lines, but without prior clean-up the quality of its summaries is 
insufficient.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to change similarity measure...

2003-12-01 Thread Ralph
Hi,

does somebody has an example of how to use another similarity class
implementation for searching? Assuming I have implemented MySimilarity

class MySimilarity implements Similarity{ 

how do I have to plug it in to acutally use it for a search in a way that I
have to program as least as possible :-) ?

Kind Regards,
Ralph

-- 
HoHoHo! Seid Ihr auch alle schön brav gewesen?

GMX Weihnachts-Special: Die 1. Adresse für Weihnachts-
männer und -frauen! http://www.gmx.net/de/cgi/specialmail

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: Real Boolean Model in Lucene?

2003-12-01 Thread Ralf B
Hi Karsten,

I want to thank you for your qualified answer as well as your answer from
the 14th of November, where you agreed with me that Lucene is basically a VSM
implementation. Sometimes it is difficult to make the link between the clear
theory and its implementation.

According to your description, Lucene basically maps the boolean query into
the vector space and measures the cosine similarity towards other documents
in the vector space. If I understood you correctly you mean if a document is
found by Lucene based on a boolean query it is relevant (boolean true). If it
is not returned, if was boolean false. The score sits on top of it and can be
used for ranking. If I would like to use true boolean model I would
therefore just need to ignore the score of the Hits document. Did I understand
correctly?

I aggree that nobody really want to do that. My question intended to find
out more about the implemented theory within Lucene.

Cheers,
Ralph


> 
> Hi,
> 
> >>
> My Question: Does Lucene use TF/IDF for getting this? (which would mean 
> it does not use the boolean model for the boolean query...)
> >>
> 
> Lucene indeed uses TF/IDF with length normalization for fields and 
> documents. 
> 
> However, Lucene is "downward compatible" to the Boolean Model where
> documents are represented as 0/1-vectors in Vector Space. Ranking just 
> adds weights to the elements of the result set, so the underlying 
> interpretation of a query result can be still that of a 
> Propositional/Boolean model. If a document appears in the result, 
> its tokens valuate the query (which actually is a propositional 
> formula formed over words and phrases) to true. The representation
> of documents is more complex in Lucene than required for the Boolean
> Model, and as a result, Lucene can efficiently handle phrases and 
> proximity searches, but these seem to be compatible extensions -
> if you can do it in the Boolean Model, you can do it in Lucene :)
> 
> One place where Lucene is not 100% compatible with a basic Boolean Model 
> is that 
> full negation is a bit tricky - you can not simply ask for all documents 
> that 
> do not contain a certain term unless you also have some term that 
> appears in all 
> documents. Not a great deal, really. 
> 
> If TF/IDF weighting is a problem to you, the Similarity interface 
> implementation allows you 
> to remove all references to length normalization and document 
> frequencies.
> 
> Regards,
> 
> Mit freundlichen Grüßen aus Saarbrücken
> 
> --
> 
> Dr.-Ing. Karsten Konrad
> Head of Artificial Intelligence Lab
> 
> XtraMind Technologies GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Phone: +49 (681) 3025113
> Fax: +49 (681) 3025109
> [EMAIL PROTECTED]
> www.xtramind.com
> 
> 
> 
> -Ursprüngliche Nachricht-
> Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> Gesendet: Montag, 1. Dezember 2003 13:11
> An: [EMAIL PROTECTED]
> Betreff: Real Boolean Model in Lucene?
> 
> 
> Hi,
> 
> is it possible to use a real boolean model in lucene for searching. When 
> one is using the Queryparser with a boolean query (i.e. "dog AND horse") 
> one does get a list of documents from the Hits object. However these 
> documents have a ranking (score).
> 
> My Question: Does Lucene use TF/IDF for getting this? (which would mean 
> it does not use the boolean model for the boolean query...)
> 
> How can one use a boolean model search, where the outcome are all 
> score=1 ? Example?
> 
> Cheers,
> Ralph
> 
> -- 
> Neu bei GMX: Preissenkung für MMS-Versand und FreeMMS!
> 
> Ideal für alle, die gerne MMS verschicken:
> 25 FreeMMS/Monat mit GMX TopMail. http://www.gmx.net/de/cgi/produktemail
> 
> +++ GMX - die erste Adresse für Mail, Message, More! +++
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
HoHoHo! Seid Ihr auch alle schön brav gewesen?

GMX Weihnachts-Special: Die 1. Adresse für Weihnachts-
männer und -frauen! http://www.gmx.net/de/cgi/specialmail

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Real Boolean Model in Lucene?

2003-12-01 Thread Karsten Konrad

Hi,

>>
My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not 
use the boolean model for the boolean query...)
>>

Lucene indeed uses TF/IDF with length normalization for fields and documents. 

However, Lucene is "downward compatible" to the Boolean Model where
documents are represented as 0/1-vectors in Vector Space. Ranking just 
adds weights to the elements of the result set, so the underlying 
interpretation of a query result can be still that of a 
Propositional/Boolean model. If a document appears in the result, 
its tokens valuate the query (which actually is a propositional 
formula formed over words and phrases) to true. The representation
of documents is more complex in Lucene than required for the Boolean
Model, and as a result, Lucene can efficiently handle phrases and 
proximity searches, but these seem to be compatible extensions -
if you can do it in the Boolean Model, you can do it in Lucene :)

One place where Lucene is not 100% compatible with a basic Boolean Model is that 
full negation is a bit tricky - you can not simply ask for all documents that 
do not contain a certain term unless you also have some term that appears in all 
documents. Not a great deal, really. 

If TF/IDF weighting is a problem to you, the Similarity interface implementation 
allows you 
to remove all references to length normalization and document frequencies.

Regards,

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com



-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 1. Dezember 2003 13:11
An: [EMAIL PROTECTED]
Betreff: Real Boolean Model in Lucene?


Hi,

is it possible to use a real boolean model in lucene for searching. When one is using 
the Queryparser with a boolean query (i.e. "dog AND horse") one does get a list of 
documents from the Hits object. However these documents have a ranking (score).

My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not 
use the boolean model for the boolean query...)

How can one use a boolean model search, where the outcome are all score=1 ? Example?

Cheers,
Ralph

-- 
Neu bei GMX: Preissenkung für MMS-Versand und FreeMMS!

Ideal für alle, die gerne MMS verschicken:
25 FreeMMS/Monat mit GMX TopMail. http://www.gmx.net/de/cgi/produktemail

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Real Boolean Model in Lucene?

2003-12-01 Thread ambiesense
Hi,

is it possible to use a real boolean model in lucene for searching. When one
is using the Queryparser with a boolean query (i.e. "dog AND horse") one
does get a list of documents from the Hits object. However these documents have
a ranking (score).

My Question: Does Lucene use TF/IDF for getting this? (which would mean it
does not use the boolean model for the boolean query...)

How can one use a boolean model search, where the outcome are all score=1 ?
Example?

Cheers,
Ralph

-- 
Neu bei GMX: Preissenkung für MMS-Versand und FreeMMS!

Ideal für alle, die gerne MMS verschicken:
25 FreeMMS/Monat mit GMX TopMail.
http://www.gmx.net/de/cgi/produktemail

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

2003-12-01 Thread Che Dong
build..properties.default 

# -
# WebLucene  BUILD  PROPERTIES
# -
jsdk_jar=/usr/local/resin/lib/jsdk23.jar

# Home directory of JavaCC
javacc.home = /usr/java/javacc/bin

# modify following on Windows
# jsdk_jar=c:\\resin\\lib\\jsdk23.jar
# javacc.home = c:\\java\\javacc\\bin


javacc.zip.dir = ${javacc.home}/lib
javacc.zip = ${javacc.zip.dir}/JavaCC.zip

Che, Dong
- Original Message - 
From: "Tun Lin" <[EMAIL PROTECTED]>
To: "'Lucene Developers List'" <[EMAIL PROTECTED]>; "'Lucene Users List'" <[EMAIL 
PROTECTED]>
Sent: Monday, December 01, 2003 11:34 AM
Subject: RE: WebLucene 0.3 release:support CJK, use sax based indexing, docID based 
result sorting and xml format output with highlighting support.


> Hi,
> 
> Do you have the install.txt for windows XP setup of the WebLucene? It seems that
> the install.txt is only for UNIX setup.
> 
> Thanks.  
> 
> -Original Message-
> From: Che Dong [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, November 30, 2003 9:57 PM
> To: Lucene Developers List; Lucene Users List
> Subject: WebLucene 0.3 release:support CJK, use sax based indexing, docID based
> result sorting and xml format output with highlighting support.
> 
> http://sourceforge.net/projects/weblucene/
> 
> WebLucene: 
> Lucene search engine XML interface, provided sax based indexing, indexing
> sequence based result sorting and xml output with highlight support.The
> CJKTokenizer support Chinese Japanese and Korean with Westen language
> simultaneously.
> 
> The key features:
> 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer
> 
> 2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher
> 
> 3 xml output: com/chedong/weblucene/search/DOMSearcher
> 
> 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
> 
> 5 token based highlighter: 
> reverse StopTokenzier:
> org/apache/lucene/anlysis/HighlightAnalyzer.java
>   HighlightFilter.java
> with abstract:
> com/chedong/weblucene/search/WebluceneHighlighter
> 
> 6 A simplified query parser:
> google like syntax with term limit
> org/apache/lucene/queryParser/SimpleQueryParser
> modified from early version of Lucene :)
> 
> Regards
> 
> Che, Dong
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>