NPE when using explain

2003-12-03 Thread Dror Matalon

Hi,

I'm trying to use IndexSearcher.explain(Query query, int doc) and am
getting a NPE. If I remove the "explain" the search works fine.
I poked a little at the TermQuery.java code, but I can't really tell
what's causing the exception.

This is with 1.3rc3


Exception in thread "main" java.lang.NullPointerException at
org.apache.lucene.search.TermQuery$TermWeight.explain(TermQuery.java:142) at
org.apache.lucene.search.BooleanQuery$BooleanWeight.explain(BooleanQuery.java:186) at
org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:196)
at LuceneCli.search(LuceneCli.java:78)
at LuceneLine.handleCommand(LuceneLine.java:188)
at LuceneLine.(LuceneLine.java:117)
at LuceneLine.main(LuceneLine.java:136)

The area of the code that caused this.

Hits hits = initSearch(queryString);
System.out.println(hits.length() + " total matching documents");

final int HITS_PER_PAGE = 10;
message ("--");
for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
  int end = Math.min(hits.length(), start + HITS_PER_PAGE);
  for (int ii = start; ii < end; ii++) {
Document doc = hits.doc(ii);
message (" " + ii + " score:" + hits.score(ii) + 
"-");
if (explain) {
  Explanation exp = searcher.explain(query, ii);
  message("Explanation:" + exp.toString());
}
printHit(doc);
  }


Regards,

Dror

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SearchBlox J2EE Search Component Version 1.1 released

2003-12-03 Thread Tatu Saloranta
On Tuesday 02 December 2003 09:51, Tun Lin wrote:
> Anyone knows a search engine that supports xml formats?

There's no way to generally "support xml formats", as xml is just a 
meta-language. However, building specific search engines using Lucene core it 
should be reasonably straight-forward to implement more accurate 
xml-structure-aware tokenization for specific xml applications like DocBook 
or other domain-specific apps.
So, if any search engine advertises "indexing xml content", one better read 
the fine print to learn what they really claim.

It might be interesting to create a Lucene plug-in that, given a specification 
of how sub trees under specific elements, would tokenize and index content 
into separate fields. Plus implementation shouldn't be very difficult -- just 
use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to 
analyzer and then add to index. This could also be used for HTML 
(pre-filtering with JTidy or similar first to get to xml-compliant HTML).
I wouldn't be surprised if someone on list has already done this?

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



summary

2003-12-03 Thread uma mahesh rao
hi

In lucene demo the summary that is displayed is having text that contained inside html 
tag (like margin, top , left and so) .

so how to display  actually in the page which is related to the page description.

ur help is appreciated 


thanking you

mahesh

AW: Document Similarity

2003-12-03 Thread Karsten Konrad

Hi,

>> Do they produce same ranking results? 

No; Lucene's operations on query weight and length normalization is not
equivalent to a vanilla cosine in vector space.

>> I guess the 2nd approach will be more precise but slow.

Query similarity 
will indeed  be faster, but may actually not be worse. A straightforward 
cosine  without IDF weighting of terms (as Lucene does) will almost certainly 
be less precise if you have documents of different length - word
occurence probabilities in texts of different lengths vary greatly,
and the cosine of independent longer texts will often be greater than 
those that actually have the same topic, but are short, just because 
of randomly found non-content words.

If, on the other hand, you choose the right TF/IDF weighting  of 
terms, the cosine in this warped vector space could be (a) 
equivalent to the one Lucene does - requires some work to do so, or 
(b) might even get better on average.

However, the last time I counted, there where about 250 different 
TF/IDF formulas around in IR publications, machine learning,
computational linguistics and so on. Performance depends on domain
and language. 

But if I was you, I just would start playing and have fun with
the stuff...

Karsten


-Ursprüngliche Nachricht-
Von: Jing Su [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 2. Dezember 2003 18:12
An: [EMAIL PROTECTED]
Betreff: Document Similarity



Hi,

I have read some posts in user/developer archives about Lucene-based document 
similarity comparison. In summary there are two approaches are
mentioned:

1 - Construct document to a query;
2 - Calculate each document to be a vector, then rank accoring to their distance 
(cosine).

Do they produce same ranking results? Is there any other way to do so? I guess the 2nd 
approach will be more precise but slow.

Thanks.

Jing

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Probabilistic Model in Lucene - possible?

2003-12-03 Thread Chong, Herb
i think i am missing the original question, but by most accepted definitions, the 
tf/idf model in Lucene is a probabilistic model. it's got strange normalizations 
though that doesn't allow comparisons of rank values across queries.

it isn't terribly hard to make a normalized probabilistic model that allows comparing 
of document scores across queries and assign a meaning to the score. i've done it. 
however, that means abandoning idf and keeping actual term frequencies for each 
document and document size. once you normalize this way, you can intermingle document 
scores from different queries and different corpora and make statements about the 
absolute value of the score. it also leads directly into the discussion we had earlier 
about interterm correlations and how to handle them properly since the full interterm 
probabilistic model has as a special case the traditional tf/idf model. interjecting 
Boolean conditions and boost makes the model much more complicated.

Herb

-Original Message-
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2003 4:51 PM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene - possible?

>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>

Sorry, I have no idea about how to use a probabilistic approach with 
Lucene, but if anyone does so, I would like to know, too. 

I am currently puzzled by a related question: I would like to know
if there are any approaches to get a confidence value for relevance 
rather than a ranking. I.e., it would be nice to have a ranking 
weight whose value has some kind of semantics such that we could 
compare results from different queries. Can probabilistic approches 
do anything like this? 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Probabilistic Model in Lucene - possible?

2003-12-03 Thread Karsten Konrad

Hi,

>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>

Sorry, I have no idea about how to use a probabilistic approach with 
Lucene, but if anyone does so, I would like to know, too. 

I am currently puzzled by a related question: I would like to know
if there are any approaches to get a confidence value for relevance 
rather than a ranking. I.e., it would be nice to have a ranking 
weight whose value has some kind of semantics such that we could 
compare results from different queries. Can probabilistic approches 
do anything like this? 

Any help appreciated,

Karsten



-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 3. Dezember 2003 15:13
An: [EMAIL PROTECTED]
Betreff: Probabilistic Model in Lucene - possible?


Hello group,

from the very inspiring conversations with Karsten I know that Lucene is based on a 
Vector Space Model. I am just wondering if it would be possible to turn this into a 
probabilistic Model approach. Of course I do know that I cannot change the underlying 
indexing and searching principles. However it would be possible to change the index 
term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I 
would need to implement another similarity algorithm.

I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible. If yes, how much effort 
would need to go into that? I am sure there are many other issues which I have not 
considered...

Kind Regards,
Ralf


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What about Spindle

2003-12-03 Thread Otis Gospodnetic
There is LARM, there is Nutch, there is Egothor (doesn't use Lucene),
etc.

Otis

--- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote:
> I think it is common task to index a jsp based web site.  A lot of
> poeple
> ask how to do so on this mailing list.  However, Lucene does not have
> a
> ready to use web crawler.  My question is that has anybody used
> Spindle to
> index a jsp based web site or is there any other tools out there.
> 
> Thanks,
> Oliver
> 
> 
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, December 03, 2003 11:25 AM
> To: Lucene Users List
> Subject: Re: What about Spindle
> 
> 
> You should ask Spindle author(s).  The error doesn't look like
> something that is related to Lucene, really.
> 
> Otis
> 
> --- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote:
> > What about Spindle? Has anybody used it to crawle a jsp based web
> > site? Do I
> > need to intall listlib.jar to do so? 
> > 
> > I got error message "Jsp Translate:Unable to find setter method for
> > attribue:class" when I tried to run listlib-example.jsp in wsad.
> > 
> > Thanks,
> > Oliver
> > 
> > 
> >  
> > 
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
> __
> Do you Yahoo!?
> Free Pop-Up Blocker - Get it now
> http://companion.yahoo.com/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ways to search indexes

2003-12-03 Thread Dror Matalon
On Wed, Dec 03, 2003 at 02:49:12PM +, jt oob wrote:
>  --- Dror Matalon <[EMAIL PROTECTED]> wrote: > On Tue, Dec 02, 2003 at
> 01:54:58PM +, jt oob wrote:
> > > Hi,
> > > 
> > > I have just indexed a lot of news (nntp) postings.
> > > I now have an index for each topic (a topic can have many
> > newsgroups)
> > > 
> > > The index sizes are:
> > > 
> > > 2.6G Current Affairs
> > > 2.4G Celebs
> > > 119M Recreation
> > > 3.0M Tech - Mac
> > > 2.4G Tech - Windows
> > > 936M Tech - Linux
> > > 702M Tech - Other
> > >  96M Tech - Consoles
> > 
> > Around 15 gigs. How many days of news?
> 
> Not sure how many days, but it's around 5 million postings.

So each posting is roughly 3K. More than I would have thought, but not
too surprising. 
The main reason I asked about how many days, is to get the sense of
growth. 15 Gig is a big index, but to understand the performance
repercussions the rate of growth is equally important. I suspect that by
the time you hit 100 gigs, you'll have one of the biggest indexes around
and you'll have to throw quite heavy hardware or distribute the load to 
get reasonable performance.

> 
> 
> Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs
> http://www.yahoo.co.uk/robbiewilliams
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What about Spindle

2003-12-03 Thread Leo Galambos
You can try Capek (needs JDK1.4, because it uses NIO). It can crawl 
whatever you like.

API:
http://www.egothor.org/api/robot/
Console - demo (*.dundee.ac.uk):
http://www.egothor.org/egothor/index.jsp?q=http%3A%2F%2Fwww.compbio.dundee.ac.uk%2F
Leo

Zhou, Oliver wrote:

I think it is common task to index a jsp based web site.  A lot of poeple
ask how to do so on this mailing list.  However, Lucene does not have a
ready to use web crawler.  My question is that has anybody used Spindle to
index a jsp based web site or is there any other tools out there.
Thanks,
Oliver


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2003 11:25 AM
To: Lucene Users List
Subject: Re: What about Spindle
You should ask Spindle author(s).  The error doesn't look like
something that is related to Lucene, really.
Otis

--- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote:
 

What about Spindle? Has anybody used it to crawle a jsp based web
site? Do I
need to intall listlib.jar to do so? 

I got error message "Jsp Translate:Unable to find setter method for
attribue:class" when I tried to run listlib-example.jsp in wsad.
Thanks,
Oliver




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: What about Spindle

2003-12-03 Thread Zhou, Oliver
I think it is common task to index a jsp based web site.  A lot of poeple
ask how to do so on this mailing list.  However, Lucene does not have a
ready to use web crawler.  My question is that has anybody used Spindle to
index a jsp based web site or is there any other tools out there.

Thanks,
Oliver



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2003 11:25 AM
To: Lucene Users List
Subject: Re: What about Spindle


You should ask Spindle author(s).  The error doesn't look like
something that is related to Lucene, really.

Otis

--- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote:
> What about Spindle? Has anybody used it to crawle a jsp based web
> site? Do I
> need to intall listlib.jar to do so? 
> 
> I got error message "Jsp Translate:Unable to find setter method for
> attribue:class" when I tried to run listlib-example.jsp in wsad.
> 
> Thanks,
> Oliver
> 
> 
>  
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What about Spindle

2003-12-03 Thread Otis Gospodnetic
You should ask Spindle author(s).  The error doesn't look like
something that is related to Lucene, really.

Otis

--- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote:
> What about Spindle? Has anybody used it to crawle a jsp based web
> site? Do I
> need to intall listlib.jar to do so? 
> 
> I got error message "Jsp Translate:Unable to find setter method for
> attribue:class" when I tried to run listlib-example.jsp in wsad.
> 
> Thanks,
> Oliver
> 
> 
>  
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What about Spindle

2003-12-03 Thread Zhou, Oliver
What about Spindle? Has anybody used it to crawle a jsp based web site? Do I
need to intall listlib.jar to do so? 

I got error message "Jsp Translate:Unable to find setter method for
attribue:class" when I tried to run listlib-example.jsp in wsad.

Thanks,
Oliver


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hits - how many documents?

2003-12-03 Thread ambiesense
That was actually the answer. Originally I thought Hits provide a reference
to all documents. However it seem logical that documents with 0.0 should not
be contained.

Thank you,
Ralf

> I'm a bit confused by what you're asking.  Hits points to all documents 
> that matched the query.  A score > 0.0 is needed.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hits - how many documents?

2003-12-03 Thread Erik Hatcher
On Wednesday, December 3, 2003, at 10:16  AM, Ralph wrote:
Does this mean Hits points to ALL documents and the last one might 
have a
score of 0.0 ? If it does not contain all documents, where is the 
treshhold
then? Or based on which condition it stops pointing to certain 
documents?
I'm a bit confused by what you're asking.  Hits points to all documents 
that matched the query.  A score > 0.0 is needed.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hits - how many documents?

2003-12-03 Thread Ralph
Does this mean Hits points to ALL documents and the last one might have a
score of 0.0 ? If it does not contain all documents, where is the treshhold
then? Or based on which condition it stops pointing to certain documents?

Ralf

> On Wednesday, December 3, 2003, at 09:36  AM, Ralph wrote:
> > is there a maximum of documents Hits provide or is it unlimited (means
> > limited to heap size of VM)? If there is a maximimum, what is the 
> > number?
> 
> Hits represents all documents that matched the query (and optionally 
> filtered).
> 
> But, Hits does not *contain* the documents - it points to them so that 
> its memory footprint is quite small.  (there is some slight caching of 
> up to 200 documents)
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ways to search indexes

2003-12-03 Thread jt oob
 --- Dror Matalon <[EMAIL PROTECTED]> wrote: > On Tue, Dec 02, 2003 at
01:54:58PM +, jt oob wrote:
> > Hi,
> > 
> > I have just indexed a lot of news (nntp) postings.
> > I now have an index for each topic (a topic can have many
> newsgroups)
> > 
> > The index sizes are:
> > 
> > 2.6G Current Affairs
> > 2.4G Celebs
> > 119M Recreation
> > 3.0M Tech - Mac
> > 2.4G Tech - Windows
> > 936M Tech - Linux
> > 702M Tech - Other
> >  96M Tech - Consoles
> 
> Around 15 gigs. How many days of news?

Not sure how many days, but it's around 5 million postings.


Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs
http://www.yahoo.co.uk/robbiewilliams

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hits - how many documents?

2003-12-03 Thread Erik Hatcher
On Wednesday, December 3, 2003, at 09:36  AM, Ralph wrote:
is there a maximum of documents Hits provide or is it unlimited (means
limited to heap size of VM)? If there is a maximimum, what is the 
number?
Hits represents all documents that matched the query (and optionally 
filtered).

But, Hits does not *contain* the documents - it points to them so that 
its memory footprint is quite small.  (there is some slight caching of 
up to 200 documents)

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hits - how many documents?

2003-12-03 Thread Ralph
Hi,

is there a maximum of documents Hits provide or is it unlimited (means
limited to heap size of VM)? If there is a maximimum, what is the number?

Ralf

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Probabilistic Model in Lucene - possible?

2003-12-03 Thread ambiesense
Hello group,

from the very inspiring conversations with Karsten I know that Lucene is
based on a Vector Space Model. I am just wondering if it would be possible to
turn this into a probabilistic Model approach. Of course I do know that I
cannot change the underlying indexing and searching principles. However it would
be possible to change the index term weight to eigther 1.0 (relevant) or 0.0
(non-relevant). For the similarity I would need to implement another
similarity algorithm.

I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible. If yes, how much
effort would need to go into that? I am sure there are many other issues which
I have not considered...

Kind Regards,
Ralf


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Query reformulation (Relevance Feedback) in Lucene?

2003-12-03 Thread Chong, Herb
there is no direct support in Lucene for this. there are several strategies for 
automatic query expansion and most of them rely on either extensive domain-specific 
analysis of the top N documents on the assumption that the search engine performs well 
enough to guarantee that the top N documents are all relevant, or that there is a 
special domain-specific corpus of "good documents" where the initial search is against 
these picked documents and their terms mined to augment the initial query before 
resubmitting to the original corpus. all of these things are things you have to do 
yourself.  term reweighting happens by using term boost. how much you boost by is an 
open question.

Herb...

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2003 6:55 AM
To: [EMAIL PROTECTED]
Subject: Query reformulation (Relevance Feedback) in Lucene?


Hello Group of Lucene users,

query reformulation is understood as a effective way to improve retrieval
power significantly. The theory teaches us that it consists of two basic steps:

a) Query expansion (with new terms)
b) Reweighting of the terms in the expanded query

User relevance feedback is the most popular reformulation strategy to
perform query reformulation because it is user centered. 

Does Lucene generally support this approach? Especially I am wondering if
...

1) there are classes which directly support query expansion OR
2) I would need to do some programming on top of more generic parts? 

I do not know about 1). All I know about 2) is what I think could work with
no evidence if it actually does :-) I think Query expansion with new terms is
easy and would just need to create a new QueryParser object with existing
terms plus the top n (most frequent terms) of the (in the user point of view)
relevant documents. Then I would have a extended query (a). However I do not
know how can I reweight this terms? When I formulate the Query I do not
actually know about there weights since it is done internally. Does anybody have
any idea? Did anybody try to solve this and has some examples which he/she
would like to provide?

Cheers,
Ralf

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query reformulation (Relevance Feedback) in Lucene?

2003-12-03 Thread ambiesense
Hello Group of Lucene users,

query reformulation is understood as a effective way to improve retrieval
power significantly. The theory teaches us that it consists of two basic steps:

a) Query expansion (with new terms)
b) Reweighting of the terms in the expanded query

User relevance feedback is the most popular reformulation strategy to
perform query reformulation because it is user centered. 

Does Lucene generally support this approach? Especially I am wondering if
...

1) there are classes which directly support query expansion OR
2) I would need to do some programming on top of more generic parts? 

I do not know about 1). All I know about 2) is what I think could work with
no evidence if it actually does :-) I think Query expansion with new terms is
easy and would just need to create a new QueryParser object with existing
terms plus the top n (most frequent terms) of the (in the user point of view)
relevant documents. Then I would have a extended query (a). However I do not
know how can I reweight this terms? When I formulate the Query I do not
actually know about there weights since it is done internally. Does anybody have
any idea? Did anybody try to solve this and has some examples which he/she
would like to provide?

Cheers,
Ralf

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Translation.

2003-12-03 Thread Otis Gospodnetic
Uh, I get to do this dirty job. :(

Lucene-user and lucene-dev are not the appropriate fora for questions
such as this one.
Please ask the original author of the text for help, or use an online
translation service, such as the one at http://babelfish.av.com

Also, for questions about Lucene usage, problems, help, etc. please
email _only_ lucene-user mailing list.  The lucene-dev mailing list is
used by developers of Lucene, and not developers who are using Lucene.

Thanks,
Otis


--- Tun Lin <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> Can anyone translate this text for me? I cannot understand the
> instructions.
> Please help!
> 
> Thanks.
> 
> ===
>  
> ||
> | LUCY 1.1   |   readme.txtUltimo aggiornamento: 18/03/2003
> ||
> 
> 
> 
> 
> 
> STRUTTURA
> 
> 
> Lucy 1.1  -> Lucene 1.2
>   -> HTMLParser 1.2
>   -> PdfBox 0.5.6
>   -> wvWare 0.7.2-3
>   -> xlhtml 0.4.9
>   -> antiword 0.33
>   -> Xpdf 2.01 
>   -> Snowball 0.1
>   -> NGramJ 01.12.11
>   -> it.corila.lucy   -> IndexAll.java
>   -> SearchIndex.java
>   -> HTMLDocument.java
>   -> PDFDocument.java
>   -> ExternalParser.java
>   -> ItalianStemFilter.java
>   -> EnglishStemFilter.java
>   -> ApostropheFilter.java
>   -> IndexAnalyzer.java
>   -> SearchAnalyzer.java
>   -> LanguageCategorizer
>   -> NgramjCategorizer.java
> 
> 
> 
> 
> 
> DESCRIZIONE
> 
> Lucy e' in grado di indicizzare tutti i files con estensione txt,
> html, pdf,
> doc, ppt, xls contenuti in una cartella base e nelle sue
> sottocartelle. Consente
> ricerche da linea di comando DOS oppure mediante interfaccia web.
> Gestisce testi
> in Italiano e Inglese con procedure di elaborazione lessicale
> specifiche.
> 
> 
> 
> 
> 
> SISTEMI OPERATIVI SUPPORTATI
> 
> Windows 98 / Windows 2000 / Windows XP
> 
> 
> 
> 
> 
> REQUISITI DI SISTEMA
> 
> Nessuno tranne i permessi necessari alla scrittura di files su una
> cartella del
> sistema
> Per utilizzare il modulo di ricerca con interfaccia web e' necessario
> disporre
> di Apache Tomcat, versione 3 o 4.
> 
> 
> 
> 
> INSTALLAZIONE
> 
> Lanciare la procedura automatica di installazione Lucy1.1.exe, oppure
> scompattare
> il file Lucy1.1.zip in una cartella (NB: il percorso non deve
> contenere spazi).
> L'applicazione utilizza di default una propria java virtual machine.
> E'
> possibile utilizzarne un'altra gia' installata nel sistema
> modificando il valore
> della variabile MYJAVAPATH nel file jvm.bat
> In questo caso la cartella jre puo' essere eliminata per ridurre
> l'occupazione
> di spazio su disco di circa 40 MBytes.
> 
> 
> 
> 
> CONFIGURAZIONE
> 
> Modificare i valori delle variabili contenute nel file
> properties.txt, nella
> cartella base dell'applicazione:
> 
> 
> lucy.path: cartella in cui si e' installata l'applicazione 
> 
> log.files.dir: cartella in cui verranno creati i files di log
> 
> del.temp.files: eliminazione dei files temporanei alla fine
> dell'indicizzazione
> (yes/no)
> 
> doc.parser: parser da utilizzare per i files .doc (antiword/wvware)
> 
> pdf.parser: parser da utilizzare per i files .pdf (xpdf/pdfbox)
> 
> index.dir: cartella in cui verranno memorizzati gli indici
> 
> index.name: nome dell'indice che deve essere creato
> 
> indexing.folder: cartella che deve essere indicizzata
> 
> 
> IMPORTANTE: tutti i percorsi devono essere indicati utilizzando come
> separatori
> di directory due barre rovesciate (\\) anziche' una barra singola
> 
> 
> 
> 
> MODALITA' DI UTILIZZO
> 
> I tre files batch nella cartella base dell'applicazione sono
> attivabili
> direttamente da Windows con doppio click.
> 
> indicizza.batcrea un indice
> 
> aggiorna.bat modifica un indice
> 
> cerca.bateffettua ricerche su un
> indice
> 
> Tutti i parametri necessari (nome e localizzazione dell'indice,
> percorso della
> cartella da indicizzare) vanno specificati a priori nel file
> properties.txt
> 
> 
> E' possibile in alternativa utilizzare le procedure da riga di
> comando dos,
> sempre con la modifica preventiva del file properties.txt
> In questo caso inoltre, mediante la sintassi:
> 
> cerca percorso-indice
> 
> si possono effettuare ricerche su altri indici creati in precedenza,
> senza
> modificare il file properties.txt
> 
> 
> 
> 
> NOTE SULL'UTILIZZO DEI PARSERS
> 
> I valori di default impostati per i parsers sono quelli consigliati
> per la prima
> esecuzion