subject:"Re\: Keyword extraction"

RE: Keyword extraction

2008-11-27 Thread Plaatje, Patrick

Hi Aleksander,

With all the help of you and the other comments, we're now at a point where a 
MoreLikeThis list is returned, and shows 10 related records. However on the 
query executed there are no keywords whatsoever being returned. Is the 
querystring still wrong or is something else required?

The querystring we're currently executing is:

http://suempnr3:8080/solr/select/?q=amsterdammlt.fl=textmlt.displayTerms=listmlt=true


Best,

Patrick 

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: woensdag 26 november 2008 15:07
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Ah, yes, That is important. In lucene, the MLT will see if the term vector is 
stored, and if it is not it will still be able to perform the querying, but in 
a much much much less efficient way.. Lucene will analyze the document (and the 
variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of 
tokens that will be parsed). (don't want to go into details on this since I 
haven't really dug through the code:p) But when the field isn't stored either, 
it is rather difficult to re-analyze the
document;)

On a general note, if you want to really understand how the MLT works, take a 
look at the wiki or read this thorough blog post:  
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

Regards,
  Aleksander

On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote:

 Hi Aleksander,

 This was a typo on my end, the original query included a semicolon 
 instead of an equal sign. But I think it has to do with my field not 
 being stored and not being identified as termVectors=true. I'm 
 recreating the index now, and see if this fixes the problem.

 Best,

 patrick

 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 14:37
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction

 Hi there!
 Well, first of all i think you have an error in your query, if I'm not 
 mistaken.
 You say http://localhost:8080/solr/select/?q=id=18477975...
 but since you are referring to the field called id, you must say:
 http://localhost:8080/solr/select/?q=id:18477975...
 (use colon instead of the equals sign).
 I think that will do the trick.
 If not, try adding the debugQuery=on at the end of your request url, 
 to see debug output on how the query is parsed and if/how any 
 documents are matched against your query.
 Hope this helps.

 Cheers,
   Aleksander



 On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick 
 [EMAIL PROTECTED] wrote:

 Hi Aleksander,

 Thanx for clearing this up. I am confident that this is a way to 
 explore for me as I'm just starting to grasp the matter. Do you know 
 why I'm not getting any results with the query posted earlier then? 
 It gives me the folowing only:

 lst name=moreLikeThis
  result name=18477975 numFound=0 start=0/ /lst

 Instead of delivering details of the interestingTerms.

 Thanks in advance

 Patrick


 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 13:03
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction

 I do not agree with you at all. The concept of MoreLikeThis is based 
 on the fundamental idea of TF-IDF weighting, and not term frequency 
 alone.
 Please take a look at:
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simi
 l ar/MoreLikeThis.html As you can see, it is possible to use cut-off 
 thresholds to significantly reduce the number of unimportant terms, 
 and generate highly suitable queries based on the tf-idf frequency of 
 the term, since as you point out, high frequency terms alone tends to 
 be useless for querying, but taking the document frequency into 
 account drastically increases the importance of the term!

 In solr, use parameters to manipulate your desired results:
 http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e
 2
 2ec5d1519c456b2c
 For instance:
 mlt.mintf - Minimum Term Frequency - the frequency below which terms 
 will be ignored in the source doc.
 mlt.mindf - Minimum Document Frequency - the frequency at which words 
 will be ignored which do not occur in at least this many docs.
 You can also set thresholds for term length etc.

 Hope this gives you a better idea of things.
 - Aleks

 On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie 
 [EMAIL PROTECTED]
 wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.

 After  briefly reading and analyzing the source code of moreLikeThis 
 function in solr, I conducted:

 MoreLikeThis uses term vectors to ranks all the terms from a 
 document by its frequency. According to its ranking, it will start 
 to generate queries, artificially, and search for documents.

 So, moreLikeThis will retrieve related documents by artificially 
 generating queries based on most frequent terms.

 There's a big problem with most

Re: Keyword extraction

2008-11-27 Thread Aleksander M. Stensby


Hi again Patrick.
Glad to hear that we can contribute to help you guys. Thats what this  
mailing list is for:)


First of all, I think you use the wrong parameter to get your terms.
Take a look at  
http://lucene.apache.org/solr/api/org/apache/solr/common/params/MoreLikeThisParams.html  
to see the supported params.
In your string you use mlt.displayTerms=list, which i believe should be  
mlt.interestingTerms=list.


If that doesn't work:
One thing you should know is that from what i can tell, you are using the  
StandardRequestHandler in your querying. The StandardRequestHandler  
supports a simplified handling of more like these queries, namely; This  
method returns similar documents for each document in the response set.
it supports the common mlt parameters, needs mlt=true (as you have done)  
and supports a mlt.count parameter to specify the number of similar  
documents returned for each matching doc from your query.


If you want to get the top keywords etc, (and in essence your  
mlt.interestingTerms=list parameter to have any effect at all, if I'm not  
completely wrong), you will need to configure up a MoreLikeThisHandler in  
your solrconfig.xml and then map that to your query.


From the sample configuration file:
	incoming queries will be dispatched to the correct handler based on the  
path or the qt (query type) param. Names starting with a '/' are accessed  
with the a path equal to the registered name.  Names without a leading '/'  
are accessed with: http://host/app/select?qt=name If no qt is defined, the  
requestHandler that declares default=true will be used.


You can read about the MoreLikeThisHandler here:  
http://wiki.apache.org/solr/MoreLikeThisHandler


Once you have it configured properly your query would be something like:
http://localhost:8983/solr/mlt?q=amsterdammlt.fl=textmlt.interestingTerms=listmlt=true  
(don't think you need the mlt=true here tho...)

or
http://localhost:8983/solr/select?qt=mltq=amsterdammlt.fl=textmlt.interestingTerms=listmlt=true
(in the last example I use qt=mlt)

Hope this helps.
Regards,
 Aleksander


On Thu, 27 Nov 2008 11:49:30 +0100, Plaatje, Patrick  
[EMAIL PROTECTED] wrote:



Hi Aleksander,

With all the help of you and the other comments, we're now at a point  
where a MoreLikeThis list is returned, and shows 10 related records.  
However on the query executed there are no keywords whatsoever being  
returned. Is the querystring still wrong or is something else required?


The querystring we're currently executing is:

http://suempnr3:8080/solr/select/?q=amsterdammlt.fl=textmlt.displayTerms=listmlt=true


Best,

Patrick

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 15:07
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Ah, yes, That is important. In lucene, the MLT will see if the term  
vector is stored, and if it is not it will still be able to perform the  
querying, but in a much much much less efficient way.. Lucene will  
analyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED  
will be used to limit the number of tokens that will be parsed). (don't  
want to go into details on this since I haven't really dug through the  
code:p) But when the field isn't stored either, it is rather difficult  
to re-analyze the

document;)

On a general note, if you want to really understand how the MLT works,  
take a look at the wiki or read this thorough blog post:

http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

Regards,
  Aleksander

On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick  
[EMAIL PROTECTED] wrote:



Hi Aleksander,

This was a typo on my end, the original query included a semicolon
instead of an equal sign. But I think it has to do with my field not
being stored and not being identified as termVectors=true. I'm
recreating the index now, and see if this fixes the problem.

Best,

patrick

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 14:37
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Hi there!
Well, first of all i think you have an error in your query, if I'm not
mistaken.
You say http://localhost:8080/solr/select/?q=id=18477975...
but since you are referring to the field called id, you must say:
http://localhost:8080/solr/select/?q=id:18477975...
(use colon instead of the equals sign).
I think that will do the trick.
If not, try adding the debugQuery=on at the end of your request url,
to see debug output on how the query is parsed and if/how any
documents are matched against your query.
Hope this helps.

Cheers,
  Aleksander



On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick
[EMAIL PROTECTED] wrote:


Hi Aleksander,

Thanx for clearing this up. I am confident that this is a way to
explore for me as I'm just starting to grasp the matter. Do you know
why I'm not getting any results with the query posted earlier

RE: Keyword extraction

2008-11-26 Thread Plaatje, Patrick

Hi All,
 
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
 
http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes
tingTerms=listmlt=truemlt.match.include=true
 
I get a moreLikeThis list though, any thoughts?
 
Best,
 
Patrick

Re: Keyword extraction

2008-11-26 Thread Scurtu Vitalie

Yes, I totally understand, and agree.  

MoreLikeThis uses TF-IDF to rank terms, then it generates queries based on top 
ranked terms.  In any case, I wasn't able to make it work after many attempts. 

Finally, I've used a different method for queries generation, and it works 
better, or at least gives some results, while with moreLikeThis results were 
poor or no result at all. 

To mention that my index was composed by short length documents, therefore the 
intersection between top ranked terms by TF-IDF was empty set.  MoreLikeThis 
works better when you have long documents. 

Yes, I've changed the thresholds for min TFIDF and max TFIDF, and others 
parameters. 

I've also used mlt.maxqt parameter  to increase the number of terms used in 
queries generation, but still didn't work well, since the method of queries 
generation based on terms with the highest TF-IDF score doesn't generate 
representative query for document.  I wasn't able to tune it. For a low value 
such as mlt.maxqt=3,4, results were poor, while for mlt.maxqt=5,6 it gave 
too many and irrelevant results. 



Thank you,
Best Wishes,
Vitalie Scurtu



--- On Wed, 11/26/08, Aleksander M. Stensby [EMAIL PROTECTED] wrote:
From: Aleksander M. Stensby aleksander.
[EMAIL PROTECTED]
Subject: Re:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 1:03 PM

I do not agree with you at all. The concept of MoreLikeThis is based on the
fundamental idea of TF-IDF weighting, and not term frequency alone.
Please take a look at:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to significantly
reduce the number of unimportant terms, and generate highly suitable queries
based on the tf-idf frequency of the term, since as you point out, high
frequency terms alone tends to be useless for querying, but taking the document
frequency into account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms will be
ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words will be
ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.
 
 After  briefly reading and analyzing the source code of moreLikeThis
function in solr, I conducted:
 
 MoreLikeThis uses term vectors to ranks all the terms from a document
 by its frequency. According to its ranking, it will start to generate
 queries, artificially, and search for documents.
 
 So, moreLikeThis will retrieve related documents by artificially
generating queries based on most frequent terms.
 
 There's a big problem with most frequent terms  from
documents. Most frequent words are usually meaningless, or so called function
words, or, people from Information Retrieval like to call them stopwords.
However, ignoring  technical problems of implementation of moreLikeThis
function, this approach is very dangerous, since queries are generated
artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it assumes
some knowledge (user knows what document he wants).
 
 I advice to use others approaches, depending on your expectation. For
example, you can extract similar documents just by searching for documents with
similar title (more like this doesn't work in this case).
 
 I hope it helps,
 Best Regards,
 Vitalie Scurtu
 --- On Wed, 11/26/08, Plaatje, Patrick
[EMAIL PROTECTED] wrote:
 From: Plaatje, Patrick [EMAIL PROTECTED]
 Subject: RE:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 10:52 AM
 
 Hi All,
 as an addition to my previous post, no interestingTerms are returned
 when i execute the folowing url:

http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes
 tingTerms=listmlt=truemlt.match.include=true
 I get a moreLikeThis list though, any thoughts?
 Best,
 Patrick
 
 
 
 



--Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

RE: Keyword extraction

2008-11-26 Thread Plaatje, Patrick

Hi Aleksander,

Thanx for clearing this up. I am confident that this is a way to explore for me 
as I'm just starting to grasp the matter. Do you know why I'm not getting any 
results with the query posted earlier then? It gives me the folowing only:

lst name=moreLikeThis
result name=18477975 numFound=0 start=0/
/lst

Instead of delivering details of the interestingTerms.

Thanks in advance

Patrick


-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: woensdag 26 november 2008 13:03
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

I do not agree with you at all. The concept of MoreLikeThis is based on the 
fundamental idea of TF-IDF weighting, and not term frequency alone.
Please take a look at:  
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to significantly 
reduce the number of unimportant terms, and generate highly suitable queries 
based on the tf-idf frequency of the term, since as you point out, high 
frequency terms alone tends to be useless for querying, but taking the document 
frequency into account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:  
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms will be 
ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words will be 
ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.

 After  briefly reading and analyzing the source code of moreLikeThis 
 function in solr, I conducted:

 MoreLikeThis uses term vectors to ranks all the terms from a document 
 by its frequency. According to its ranking, it will start to generate 
 queries, artificially, and search for documents.

 So, moreLikeThis will retrieve related documents by artificially 
 generating queries based on most frequent terms.

 There's a big problem with most frequent terms  from documents. Most 
 frequent words are usually meaningless, or so called function words, 
 or, people from Information Retrieval like to call them stopwords. 
 However, ignoring  technical problems of implementation of 
 moreLikeThis function, this approach is very dangerous, since queries 
 are generated artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it 
 assumes some knowledge (user knows what document he wants).

 I advice to use others approaches, depending on your expectation. For 
 example, you can extract similar documents just by searching for 
 documents with similar title (more like this doesn't work in this case).

 I hope it helps,
 Best Regards,
 Vitalie Scurtu
 --- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED]
 wrote:
 From: Plaatje, Patrick [EMAIL PROTECTED]
 Subject: RE:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 10:52 AM

 Hi All,
 as an addition to my previous post, no interestingTerms are returned 
 when i execute the folowing url:
 http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inter
 es tingTerms=listmlt=truemlt.match.include=true
 I get a moreLikeThis list though, any thoughts?
 Best,
 Patrick







--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Keyword extraction

2008-11-26 Thread Aleksander M. Stensby


Hi there!
Well, first of all i think you have an error in your query, if I'm not  
mistaken.

You say http://localhost:8080/solr/select/?q=id=18477975...
but since you are referring to the field called id, you must say:
http://localhost:8080/solr/select/?q=id:18477975...
(use colon instead of the equals sign).
I think that will do the trick.
If not, try adding the debugQuery=on at the end of your request url, to  
see debug output on how the query is parsed and if/how any documents are  
matched against your query.

Hope this helps.

Cheers,
 Aleksander



On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick  
[EMAIL PROTECTED] wrote:



Hi Aleksander,

Thanx for clearing this up. I am confident that this is a way to explore  
for me as I'm just starting to grasp the matter. Do you know why I'm not  
getting any results with the query posted earlier then? It gives me the  
folowing only:


lst name=moreLikeThis
result name=18477975 numFound=0 start=0/
/lst

Instead of delivering details of the interestingTerms.

Thanks in advance

Patrick


-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 13:03
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

I do not agree with you at all. The concept of MoreLikeThis is based on  
the fundamental idea of TF-IDF weighting, and not term frequency alone.

Please take a look at:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to  
significantly reduce the number of unimportant terms, and generate  
highly suitable queries based on the tf-idf frequency of the term, since  
as you point out, high frequency terms alone tends to be useless for  
querying, but taking the document frequency into account drastically  
increases the importance of the term!


In solr, use parameters to manipulate your desired results:
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms  
will be ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words  
will be ignored which do not occur in at least this many docs.

You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:


Dear Partick, I had the same problem with MoreLikeThis function.

After  briefly reading and analyzing the source code of moreLikeThis
function in solr, I conducted:

MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents.

So, moreLikeThis will retrieve related documents by artificially
generating queries based on most frequent terms.

There's a big problem with most frequent terms  from documents. Most
frequent words are usually meaningless, or so called function words,
or, people from Information Retrieval like to call them stopwords.
However, ignoring  technical problems of implementation of
moreLikeThis function, this approach is very dangerous, since queries
are generated artificially based on a given document.
Writting queries for retrieving a document is a human task, and it
assumes some knowledge (user knows what document he wants).

I advice to use others approaches, depending on your expectation. For
example, you can extract similar documents just by searching for
documents with similar title (more like this doesn't work in this case).

I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED]
wrote:
From: Plaatje, Patrick [EMAIL PROTECTED]
Subject: RE:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inter
es tingTerms=listmlt=truemlt.match.include=true
I get a moreLikeThis list though, any thoughts?
Best,
Patrick








--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no





--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

RE: Keyword extraction

2008-11-26 Thread Plaatje, Patrick

Hi Aleksander,

This was a typo on my end, the original query included a semicolon instead of 
an equal sign. But I think it has to do with my field not being stored and not 
being identified as termVectors=true. I'm recreating the index now, and see 
if this fixes the problem.

Best,

patrick

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: woensdag 26 november 2008 14:37
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Hi there!
Well, first of all i think you have an error in your query, if I'm not mistaken.
You say http://localhost:8080/solr/select/?q=id=18477975...
but since you are referring to the field called id, you must say:
http://localhost:8080/solr/select/?q=id:18477975...
(use colon instead of the equals sign).
I think that will do the trick.
If not, try adding the debugQuery=on at the end of your request url, to see 
debug output on how the query is parsed and if/how any documents are matched 
against your query.
Hope this helps.

Cheers,
  Aleksander



On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote:

 Hi Aleksander,

 Thanx for clearing this up. I am confident that this is a way to 
 explore for me as I'm just starting to grasp the matter. Do you know 
 why I'm not getting any results with the query posted earlier then? It 
 gives me the folowing only:

 lst name=moreLikeThis
   result name=18477975 numFound=0 start=0/ /lst

 Instead of delivering details of the interestingTerms.

 Thanks in advance

 Patrick


 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 13:03
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction

 I do not agree with you at all. The concept of MoreLikeThis is based 
 on the fundamental idea of TF-IDF weighting, and not term frequency alone.
 Please take a look at:
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil
 ar/MoreLikeThis.html As you can see, it is possible to use cut-off 
 thresholds to significantly reduce the number of unimportant terms, 
 and generate highly suitable queries based on the tf-idf frequency of 
 the term, since as you point out, high frequency terms alone tends to 
 be useless for querying, but taking the document frequency into 
 account drastically increases the importance of the term!

 In solr, use parameters to manipulate your desired results:
 http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2
 2ec5d1519c456b2c
 For instance:
 mlt.mintf - Minimum Term Frequency - the frequency below which terms 
 will be ignored in the source doc.
 mlt.mindf - Minimum Document Frequency - the frequency at which words 
 will be ignored which do not occur in at least this many docs.
 You can also set thresholds for term length etc.

 Hope this gives you a better idea of things.
 - Aleks

 On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
 wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.

 After  briefly reading and analyzing the source code of moreLikeThis 
 function in solr, I conducted:

 MoreLikeThis uses term vectors to ranks all the terms from a document 
 by its frequency. According to its ranking, it will start to generate 
 queries, artificially, and search for documents.

 So, moreLikeThis will retrieve related documents by artificially 
 generating queries based on most frequent terms.

 There's a big problem with most frequent terms  from documents. 
 Most frequent words are usually meaningless, or so called function 
 words, or, people from Information Retrieval like to call them stopwords.
 However, ignoring  technical problems of implementation of 
 moreLikeThis function, this approach is very dangerous, since queries 
 are generated artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it 
 assumes some knowledge (user knows what document he wants).

 I advice to use others approaches, depending on your expectation. For 
 example, you can extract similar documents just by searching for 
 documents with similar title (more like this doesn't work in this case).

 I hope it helps,
 Best Regards,
 Vitalie Scurtu
 --- On Wed, 11/26/08, Plaatje, Patrick 
 [EMAIL PROTECTED]
 wrote:
 From: Plaatje, Patrick [EMAIL PROTECTED]
 Subject: RE:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 10:52 AM

 Hi All,
 as an addition to my previous post, no interestingTerms are returned 
 when i execute the folowing url:
 http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inte
 r es tingTerms=listmlt=truemlt.match.include=true
 I get a moreLikeThis list though, any thoughts?
 Best,
 Patrick







 --
 Aleksander M. Stensby
 Senior software developer
 Integrasco A/S
 www.integrasco.no




--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Keyword extraction

2008-11-26 Thread Aleksander M. Stensby

I'm sure that for certain problems and cases you will need to do quite a  
bit tweaking to make it work (to suite your needs), but i responded to  
your statement because you made it sound like the MoreLikeThis component  
does not work at all for its purpuse, while it actually do work as  
intended and can be of great aid in constructing queries to retrieve  
same-topic-documents etc.


- Aleksander

On Wed, 26 Nov 2008 14:10:57 +0100, Scurtu Vitalie [EMAIL PROTECTED]  
wrote:



Yes, I totally understand, and agree. 

MoreLikeThis uses TF-IDF to rank terms, then it generates queries based  
on top ranked terms.  In any case, I wasn't able to make it work after  
many attempts.


Finally, I've used a different method for queries generation, and it  
works better, or at least gives some results, while with moreLikeThis  
results were poor or no result at all.


To mention that my index was composed by short length documents,  
therefore the intersection between top ranked terms by TF-IDF was empty  
set.  MoreLikeThis works better when you have long documents.


Yes, I've changed the thresholds for min TFIDF and max TFIDF, and others  
parameters.


I've also used mlt.maxqt parameter  to increase the number of terms  
used in queries generation, but still didn't work well, since the method  
of queries generation based on terms with the highest TF-IDF score  
doesn't generate representative query for document.  I wasn't able to  
tune it. For a low value such as mlt.maxqt=3,4, results were poor, while  
for mlt.maxqt=5,6 it gave too many and irrelevant results.




Thank you,
Best Wishes,
Vitalie Scurtu



--- On Wed, 11/26/08, Aleksander M. Stensby  
[EMAIL PROTECTED] wrote:

From: Aleksander M. Stensby aleksander.
[EMAIL PROTECTED]
Subject: Re:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 1:03 PM

I do not agree with you at all. The concept of MoreLikeThis is based on  
the

fundamental idea of TF-IDF weighting, and not term frequency alone.
Please take a look at:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to significantly
reduce the number of unimportant terms, and generate highly suitable  
queries

based on the tf-idf frequency of the term, since as you point out, high
frequency terms alone tends to be useless for querying, but taking the  
document

frequency into account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms  
will be

ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words  
will be

ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:


Dear Partick, I had the same problem with MoreLikeThis function.

After  briefly reading and analyzing the source code of moreLikeThis

function in solr, I conducted:


MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents.

So, moreLikeThis will retrieve related documents by artificially

generating queries based on most frequent terms.


There's a big problem with most frequent terms  from
documents. Most frequent words are usually meaningless, or so called  
function

words, or, people from Information Retrieval like to call them stopwords.
However, ignoring  technical problems of implementation of moreLikeThis
function, this approach is very dangerous, since queries are generated
artificially based on a given document.
Writting queries for retrieving a document is a human task, and it  
assumes

some knowledge (user knows what document he wants).


I advice to use others approaches, depending on your expectation. For
example, you can extract similar documents just by searching for  
documents with

similar title (more like this doesn't work in this case).


I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick

[EMAIL PROTECTED] wrote:

From: Plaatje, Patrick [EMAIL PROTECTED]
Subject: RE:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:


http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes

tingTerms=listmlt=truemlt.match.include=true
I get a moreLikeThis list though, any thoughts?
Best,
Patrick








--Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

RE: Keyword extraction

2008-11-26 Thread Scurtu Vitalie

Dear Partick, I had the same problem with MoreLikeThis function. 

After  briefly reading and analyzing the source code of moreLikeThis function 
in solr, I conducted:

MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents. 

So, moreLikeThis will retrieve related documents by artificially generating 
queries based on most frequent terms. 

There's a big problem with most frequent terms  from documents. Most frequent 
words are usually meaningless, or so called function words, or, people from 
Information Retrieval like to call them stopwords. However, ignoring  technical 
problems of implementation of moreLikeThis function, this approach is very 
dangerous, since queries are generated artificially based on a given document. 
Writting queries for retrieving a document is a human task, and it assumes some 
knowledge (user knows what document he wants). 

I advice to use others approaches, depending on your expectation. For example, 
you can extract similar documents just by searching for documents with similar 
title (more like this doesn't work in this case). 

I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED] wrote:
From: Plaatje, Patrick [EMAIL PROTECTED]
Subject: RE:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
 
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
 
http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes
tingTerms=listmlt=truemlt.match.include=true
 
I get a moreLikeThis list though, any thoughts?
 
Best,
 
Patrick

Re: Keyword extraction

2008-11-26 Thread Aleksander M. Stensby

Ah, yes, That is important. In lucene, the MLT will see if the term vector  
is stored, and if it is not it will still be able to perform the querying,  
but in a much much much less efficient way.. Lucene will analyze the  
document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to  
limit the number of tokens that will be parsed). (don't want to go into  
details on this since I haven't really dug through the code:p) But when  
the field isn't stored either, it is rather difficult to re-analyze the  
document;)


On a general note, if you want to really understand how the MLT works,  
take a look at the wiki or read this thorough blog post:  
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/


Regards,
 Aleksander

On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick  
[EMAIL PROTECTED] wrote:



Hi Aleksander,

This was a typo on my end, the original query included a semicolon  
instead of an equal sign. But I think it has to do with my field not  
being stored and not being identified as termVectors=true. I'm  
recreating the index now, and see if this fixes the problem.


Best,

patrick

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 14:37
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Hi there!
Well, first of all i think you have an error in your query, if I'm not  
mistaken.

You say http://localhost:8080/solr/select/?q=id=18477975...
but since you are referring to the field called id, you must say:
http://localhost:8080/solr/select/?q=id:18477975...
(use colon instead of the equals sign).
I think that will do the trick.
If not, try adding the debugQuery=on at the end of your request url, to  
see debug output on how the query is parsed and if/how any documents are  
matched against your query.

Hope this helps.

Cheers,
  Aleksander



On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick  
[EMAIL PROTECTED] wrote:



Hi Aleksander,

Thanx for clearing this up. I am confident that this is a way to
explore for me as I'm just starting to grasp the matter. Do you know
why I'm not getting any results with the query posted earlier then? It
gives me the folowing only:

lst name=moreLikeThis
result name=18477975 numFound=0 start=0/ /lst

Instead of delivering details of the interestingTerms.

Thanks in advance

Patrick


-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 13:03
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

I do not agree with you at all. The concept of MoreLikeThis is based
on the fundamental idea of TF-IDF weighting, and not term frequency  
alone.

Please take a look at:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil
ar/MoreLikeThis.html As you can see, it is possible to use cut-off
thresholds to significantly reduce the number of unimportant terms,
and generate highly suitable queries based on the tf-idf frequency of
the term, since as you point out, high frequency terms alone tends to
be useless for querying, but taking the document frequency into
account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2
2ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms
will be ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words
will be ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:


Dear Partick, I had the same problem with MoreLikeThis function.

After  briefly reading and analyzing the source code of moreLikeThis
function in solr, I conducted:

MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents.

So, moreLikeThis will retrieve related documents by artificially
generating queries based on most frequent terms.

There's a big problem with most frequent terms  from documents.
Most frequent words are usually meaningless, or so called function
words, or, people from Information Retrieval like to call them  
stopwords.

However, ignoring  technical problems of implementation of
moreLikeThis function, this approach is very dangerous, since queries
are generated artificially based on a given document.
Writting queries for retrieving a document is a human task, and it
assumes some knowledge (user knows what document he wants).

I advice to use others approaches, depending on your expectation. For
example, you can extract similar documents just by searching for
documents with similar title (more like this doesn't work in this  
case).


I hope

Re: Keyword extraction

2008-11-26 Thread Jeff Newburn

Unfortunately, as it stands the interestingTerms and the debugQuery do not
explain why solr chose the matches it did for moreLikeThis.  There is
currently a task in jira to try to add the information to debugQuery.

The ticket can be found here: https://issues.apache.org/jira/browse/SOLR-860

-Jeff


On 11/26/08 5:41 AM, Plaatje, Patrick [EMAIL PROTECTED]
wrote:

 Hi Aleksander,
 
 This was a typo on my end, the original query included a semicolon instead of
 an equal sign. But I think it has to do with my field not being stored and not
 being identified as termVectors=true. I'm recreating the index now, and see
 if this fixes the problem.
 
 Best,
 
 patrick
 
 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 14:37
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction
 
 Hi there!
 Well, first of all i think you have an error in your query, if I'm not
 mistaken.
 You say http://localhost:8080/solr/select/?q=id=18477975...
 but since you are referring to the field called id, you must say:
 http://localhost:8080/solr/select/?q=id:18477975...
 (use colon instead of the equals sign).
 I think that will do the trick.
 If not, try adding the debugQuery=on at the end of your request url, to see
 debug output on how the query is parsed and if/how any documents are matched
 against your query.
 Hope this helps.
 
 Cheers,
   Aleksander
 
 
 
 On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick
 [EMAIL PROTECTED] wrote:
 
 Hi Aleksander,
 
 Thanx for clearing this up. I am confident that this is a way to
 explore for me as I'm just starting to grasp the matter. Do you know
 why I'm not getting any results with the query posted earlier then? It
 gives me the folowing only:
 
 lst name=moreLikeThis
 result name=18477975 numFound=0 start=0/ /lst
 
 Instead of delivering details of the interestingTerms.
 
 Thanks in advance
 
 Patrick
 
 
 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 13:03
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction
 
 I do not agree with you at all. The concept of MoreLikeThis is based
 on the fundamental idea of TF-IDF weighting, and not term frequency alone.
 Please take a look at:
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil
 ar/MoreLikeThis.html As you can see, it is possible to use cut-off
 thresholds to significantly reduce the number of unimportant terms,
 and generate highly suitable queries based on the tf-idf frequency of
 the term, since as you point out, high frequency terms alone tends to
 be useless for querying, but taking the document frequency into
 account drastically increases the importance of the term!
 
 In solr, use parameters to manipulate your desired results:
 http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2
 2ec5d1519c456b2c
 For instance:
 mlt.mintf - Minimum Term Frequency - the frequency below which terms
 will be ignored in the source doc.
 mlt.mindf - Minimum Document Frequency - the frequency at which words
 will be ignored which do not occur in at least this many docs.
 You can also set thresholds for term length etc.
 
 Hope this gives you a better idea of things.
 - Aleks
 
 On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
 wrote:
 
 Dear Partick, I had the same problem with MoreLikeThis function.
 
 After  briefly reading and analyzing the source code of moreLikeThis
 function in solr, I conducted:
 
 MoreLikeThis uses term vectors to ranks all the terms from a document
 by its frequency. According to its ranking, it will start to generate
 queries, artificially, and search for documents.
 
 So, moreLikeThis will retrieve related documents by artificially
 generating queries based on most frequent terms.
 
 There's a big problem with most frequent terms  from documents.
 Most frequent words are usually meaningless, or so called function
 words, or, people from Information Retrieval like to call them stopwords.
 However, ignoring  technical problems of implementation of
 moreLikeThis function, this approach is very dangerous, since queries
 are generated artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it
 assumes some knowledge (user knows what document he wants).
 
 I advice to use others approaches, depending on your expectation. For
 example, you can extract similar documents just by searching for
 documents with similar title (more like this doesn't work in this case).
 
 I hope it helps,
 Best Regards,
 Vitalie Scurtu
 --- On Wed, 11/26/08, Plaatje, Patrick
 [EMAIL PROTECTED]
 wrote:
 From: Plaatje, Patrick [EMAIL PROTECTED]
 Subject: RE:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 10:52 AM
 
 Hi All,
 as an addition to my previous post, no interestingTerms are returned
 when i execute the folowing url:
 http://localhost

Re: Keyword extraction

2008-11-26 Thread Scurtu Vitalie

Sorry for not writing clearly. 

Yes, it works good for its purpose, and I didn't want to say that moreLikeThis 
component does not work at all. 

In the same time it's good to know what are the limitations and the problems of 
moreLikeThis function. 

What I want to point out is that queries_generation is one of fundamental 
problems in Information Retrieval, and independent of the implementation of 
moreLikeThis function, it can give inappropriate results. 

Best Wishes,
Vitalie Scurtu


--- On Wed, 11/26/08, Aleksander M. Stensby [EMAIL PROTECTED] wrote:
From: Aleksander M. Stensby [EMAIL PROTECTED]
Subject: Re:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 2:43 PM


I'm sure that for certain problems and cases you will need to do quite a bit
tweaking to make it work (to suite your needs), but i responded to your
statement because you made it sound like the MoreLikeThis component does not
work at all for its purpuse, while it actually do work as intended and can be of
great aid in constructing queries to retrieve same-topic-documents etc.

- Aleksander

On Wed, 26 Nov 2008 14:10:57 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:

 Yes, I totally understand, and agree. 
 
 MoreLikeThis uses TF-IDF to rank terms, then it generates queries based on
top ranked terms.  In any case, I wasn't able to make it work after many
attempts.
 
 Finally, I've used a different method for queries generation, and it
works better, or at least gives some results, while with moreLikeThis results
were poor or no result at all.
 
 To mention that my index was composed by short length documents, therefore
the intersection between top ranked terms by TF-IDF was empty set. 
MoreLikeThis works better when you have long documents.
 
 Yes, I've changed the thresholds for min TFIDF and max TFIDF, and
others parameters.
 
 I've also used mlt.maxqt parameter  to increase the
number of terms used in queries generation, but still didn't work well,
since the method of queries generation based on terms with the highest TF-IDF
score doesn't generate representative query for document.  I wasn't
able to tune it. For a low value such as mlt.maxqt=3,4, results were poor, while
for mlt.maxqt=5,6 it gave too many and irrelevant results.
 
 
 
 Thank you,
 Best Wishes,
 Vitalie Scurtu
 
 
 
 --- On Wed, 11/26/08, Aleksander M. Stensby
[EMAIL PROTECTED] wrote:
 From: Aleksander M. Stensby aleksander.
 [EMAIL PROTECTED]
 Subject: Re:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 1:03 PM
 
 I do not agree with you at all. The concept of MoreLikeThis is based on
the
 fundamental idea of TF-IDF weighting, and not term frequency alone.
 Please take a look at:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
 As you can see, it is possible to use cut-off thresholds to significantly
 reduce the number of unimportant terms, and generate highly suitable
queries
 based on the tf-idf frequency of the term, since as you point out, high
 frequency terms alone tends to be useless for querying, but taking the
document
 frequency into account drastically increases the importance of the term!
 
 In solr, use parameters to manipulate your desired results:

http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
 For instance:
 mlt.mintf - Minimum Term Frequency - the frequency below which terms will
be
 ignored in the source doc.
 mlt.mindf - Minimum Document Frequency - the frequency at which words will
be
 ignored which do not occur in at least this many docs.
 You can also set thresholds for term length etc.
 
 Hope this gives you a better idea of things.
 - Aleks
 
 On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie
[EMAIL PROTECTED]
 wrote:
 
 Dear Partick, I had the same problem with MoreLikeThis function.
 
 After  briefly reading and analyzing the source code of moreLikeThis
 function in solr, I conducted:
 
 MoreLikeThis uses term vectors to ranks all the terms from a document
 by its frequency. According to its ranking, it will start to generate
 queries, artificially, and search for documents.
 
 So, moreLikeThis will retrieve related documents by artificially
 generating queries based on most frequent terms.
 
 There's a big problem with most frequent terms  from
 documents. Most frequent words are usually meaningless, or so called
function
 words, or, people from Information Retrieval like to call them stopwords.
 However, ignoring  technical problems of implementation of moreLikeThis
 function, this approach is very dangerous, since queries are generated
 artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it
assumes
 some knowledge (user knows what document he wants).
 
 I advice to use others approaches, depending on your expectation. For
 example, you can extract similar documents just by searching for documents
with
 similar title (more

Re: Keyword extraction

2008-11-26 Thread Aleksander M. Stensby

I do not agree with you at all. The concept of MoreLikeThis is based on  
the fundamental idea of TF-IDF weighting, and not term frequency alone.
Please take a look at:  
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to significantly  
reduce the number of unimportant terms, and generate highly suitable  
queries based on the tf-idf frequency of the term, since as you point out,  
high frequency terms alone tends to be useless for querying, but taking  
the document frequency into account drastically increases the importance  
of the term!


In solr, use parameters to manipulate your desired results:  
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c

For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms will  
be ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words will  
be ignored which do not occur in at least this many docs.

You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]  
wrote:



Dear Partick, I had the same problem with MoreLikeThis function.

After  briefly reading and analyzing the source code of moreLikeThis  
function in solr, I conducted:


MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents.

So, moreLikeThis will retrieve related documents by artificially  
generating queries based on most frequent terms.


There's a big problem with most frequent terms  from documents. Most  
frequent words are usually meaningless, or so called function words, or,  
people from Information Retrieval like to call them stopwords. However,  
ignoring  technical problems of implementation of moreLikeThis function,  
this approach is very dangerous, since queries are generated  
artificially based on a given document.
Writting queries for retrieving a document is a human task, and it  
assumes some knowledge (user knows what document he wants).


I advice to use others approaches, depending on your expectation. For  
example, you can extract similar documents just by searching for  
documents with similar title (more like this doesn't work in this case).


I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED]  
wrote:

From: Plaatje, Patrick [EMAIL PROTECTED]
Subject: RE:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes
tingTerms=listmlt=truemlt.match.include=true
I get a moreLikeThis list though, any thoughts?
Best,
Patrick








--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Keyword extraction

2008-11-26 Thread Shalin Shekhar Mangar

You might also be interested in
http://wiki.apache.org/solr/TermVectorComponent

On Wed, Nov 26, 2008 at 12:39 AM, Plaatje, Patrick 
[EMAIL PROTECTED] wrote:

 Hi all,

 Strugling with a question I recently got from a collegue: is it possible
 to extract keywords from indexed content?

 In my opinion it should be possible to find out on what words the
 ranking of the indexed content is the highest (Lucene or Solr), but have
 no clue where to begin. Anyone having suggestions?

 Best,

 Patrick




-- 
Regards,
Shalin Shekhar Mangar.

Re: Keyword extraction

2008-11-25 Thread Ryan McKinley


lots of approaches out there...

the easiest off the shelf method would be to use the  
MoreLikeThisHandler and get the top interesting terms;


http://wiki.apache.org/solr/MoreLikeThisHandler

ryan


On Nov 25, 2008, at 2:09 PM, Plaatje, Patrick wrote:


Hi all,

Strugling with a question I recently got from a collegue: is it  
possible

to extract keywords from indexed content?

In my opinion it should be possible to find out on what words the
ranking of the indexed content is the highest (Lucene or Solr), but  
have

no clue where to begin. Anyone having suggestions?

Best,

Patrick

RE: Keyword extraction

Re: Keyword extraction

RE: Keyword extraction

Re: Keyword extraction

RE: Keyword extraction

Re: Keyword extraction

RE: Keyword extraction

Re: Keyword extraction

RE: Keyword extraction

Re: Keyword extraction

Re: Keyword extraction

Re: Keyword extraction

Re: Keyword extraction

Re: Keyword extraction

Re: Keyword extraction

15 matches

Site Navigation

Mail list logo

Footer information