subject:"Selection of terms for MoreLikeThis"

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg

Any ideas on this? Is it worth sending a bug report?

Those links are live, by the way, in case anyone wants to verify that MLT is
returning suggestions with very low tf.idf.

Cheers,

Andrew.

Andrew Clegg wrote:

Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1

one of the hits in the results is and (I don't do any stopword removal
on this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720tv=truetv.all=truetv.fl=keywords

I see that and has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:

lst name=aquaspirillum
int name=tf1/int
int name=df10/int
double name=tf-idf0.1/double
/lst

that *don't* appear in the MoreLikeThis list. (I tried adding
mlt.maxwl=999 to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4?
Or does MoreLikeThis do some other heuristic magic to select good
candidates, and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

--
:: http://biotext.org.uk/ ::

--
View this message in context:
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Chantal Ackermann

Hi Andrew,

no idea, I'm afraid - but could you sent the output of
interestingTerms=details?
This at least would show what MoreLikeThis uses, in comparison to the
TermVectorComponent you've already pasted.

Chantal

Andrew Clegg schrieb:

Any ideas on this? Is it worth sending a bug report?

Those links are live, by the way, in case anyone wants to verify that MLT is
returning suggestions with very low tf.idf.

Cheers,

Andrew.

Andrew Clegg wrote:

Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1

one of the hits in the results is and (I don't do any stopword removal
on this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720tv=truetv.all=truetv.fl=keywords

I see that and has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:

lst name=aquaspirillum
int name=tf1/int
int name=df10/int
double name=tf-idf0.1/double
/lst

that *don't* appear in the MoreLikeThis list. (I tried adding
mlt.maxwl=999 to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4?
Or does MoreLikeThis do some other heuristic magic to select good
candidates, and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

--
:: http://biotext.org.uk/ ::

--
View this message in context:
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg



Chantal Ackermann wrote:
 
 no idea, I'm afraid - but could you sent the output of 
 interestingTerms=details?
 This at least would show what MoreLikeThis uses, in comparison to the 
 TermVectorComponent you've already pasted.
 

I can, but I'm afraid they're not very illuminating!

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=detailsmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1

response
lst name=responseHeader
 int name=status0/int
 int name=QTime59/int
/lst
result name=response numFound=280227 start=0/
lst name=interestingTerms
 float name=keywords:dehydrogenase1.0/float
 float name=keywords:reductase1.0/float
 float name=keywords:metabolism1.0/float
 float name=keywords:activity1.0/float
 float name=keywords:process1.0/float
 float name=keywords:alcohol1.0/float
 float name=keywords:and1.0/float
 float name=keywords:malate1.0/float
 float name=keywords:biosynthesis1.0/float
 float name=keywords:biosynthetic1.0/float
 float name=keywords:degradation1.0/float
 float name=keywords:precursor1.0/float
 float name=keywords:metabolic1.0/float
 float name=keywords:protein1.0/float
 float name=keywords:synthase1.0/float
 float name=keywords:acid1.0/float
 float name=keywords:enzyme1.0/float
 float name=keywords:succinyl-coa1.0/float
 float name=keywords:putative1.0/float
 float name=keywords:(nadp+)1.0/float
 float name=keywords:4,6-dehydratase1.0/float
 float name=keywords:fatty1.0/float
 float name=keywords:chloroplast1.0/float
 float name=keywords:lactobacillus1.0/float
 float name=keywords:glyoxylate1.0/float
/lst
/response

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Chantal Ackermann


Hi Andrew,

your URL does not include the parameter mlt.boost. Setting that to 
true made a noticeable difference for my queries.


If not, there is also the parameter
 mlt.minwl
minimum word length below which words will be ignored.

All your other terms seem longer than 3, so it would help in this case? 
But seems a bit like work around.


Cheers,
Chantal

Andrew Clegg schrieb:


Chantal Ackermann wrote:

no idea, I'm afraid - but could you sent the output of
interestingTerms=details?
This at least would show what MoreLikeThis uses, in comparison to the
TermVectorComponent you've already pasted.



I can, but I'm afraid they're not very illuminating!

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=detailsmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1

response
lst name=responseHeader
 int name=status0/int
 int name=QTime59/int
/lst
result name=response numFound=280227 start=0/
lst name=interestingTerms
 float name=keywords:dehydrogenase1.0/float
 float name=keywords:reductase1.0/float
 float name=keywords:metabolism1.0/float
 float name=keywords:activity1.0/float
 float name=keywords:process1.0/float
 float name=keywords:alcohol1.0/float
 float name=keywords:and1.0/float
 float name=keywords:malate1.0/float
 float name=keywords:biosynthesis1.0/float
 float name=keywords:biosynthetic1.0/float
 float name=keywords:degradation1.0/float
 float name=keywords:precursor1.0/float
 float name=keywords:metabolic1.0/float
 float name=keywords:protein1.0/float
 float name=keywords:synthase1.0/float
 float name=keywords:acid1.0/float
 float name=keywords:enzyme1.0/float
 float name=keywords:succinyl-coa1.0/float
 float name=keywords:putative1.0/float
 float name=keywords:(nadp+)1.0/float
 float name=keywords:4,6-dehydratase1.0/float
 float name=keywords:fatty1.0/float
 float name=keywords:chloroplast1.0/float
 float name=keywords:lactobacillus1.0/float
 float name=keywords:glyoxylate1.0/float
/lst
/response

Cheers,

Andrew.

--
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html
Sent from the Solr - User mailing list archive at Nabble.com.

Selection of terms for MoreLikeThis

2009-11-10 Thread Andrew Clegg


Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1

one of the hits in the results is and (I don't do any stopword removal on
this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720tv=truetv.all=truetv.fl=keywords

I see that and has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:

lst name=aquaspirillum
int name=tf1/int
int name=df10/int
double name=tf-idf0.1/double
/lst

that *don't* appear in the MoreLikeThis list. (I tried adding mlt.maxwl=999
to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or
does MoreLikeThis do some other heuristic magic to select good candidates,
and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

-- 
:: http://biotext.org.uk/ ::

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26286005.html
Sent from the Solr - User mailing list archive at Nabble.com.