Re: Selection of terms for MoreLikeThis
Any ideas on this? Is it worth sending a bug report? Those links are live, by the way, in case anyone wants to verify that MLT is returning suggestions with very low tf.idf. Cheers, Andrew. Andrew Clegg wrote: Hi, If I run a MoreLikeThis query like the following: http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1 one of the hits in the results is and (I don't do any stopword removal on this field). However if I look inside that document with the TermVectorComponent: http://www.cathdb.info/solr/select/?q=id:3.40.50.720tv=truetv.all=truetv.fl=keywords I see that and has a measly tf.idf of 7.46E-4. But there are other terms with *much* higher tf.idf scores, e.g.: lst name=aquaspirillum int name=tf1/int int name=df10/int double name=tf-idf0.1/double /lst that *don't* appear in the MoreLikeThis list. (I tried adding mlt.maxwl=999 to the end of the MLT query but it makes no difference.) What's going on? Surely something with tf.idf = 0.1 is a far better candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or does MoreLikeThis do some other heuristic magic to select good candidates, and sometimes get it wrong? BTW the keywords field is indexed, stored, multi-valued and term-vectored. Thanks, Andrew. -- :: http://biotext.org.uk/ :: -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selection of terms for MoreLikeThis
Hi Andrew, no idea, I'm afraid - but could you sent the output of interestingTerms=details? This at least would show what MoreLikeThis uses, in comparison to the TermVectorComponent you've already pasted. Chantal Andrew Clegg schrieb: Any ideas on this? Is it worth sending a bug report? Those links are live, by the way, in case anyone wants to verify that MLT is returning suggestions with very low tf.idf. Cheers, Andrew. Andrew Clegg wrote: Hi, If I run a MoreLikeThis query like the following: http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1 one of the hits in the results is and (I don't do any stopword removal on this field). However if I look inside that document with the TermVectorComponent: http://www.cathdb.info/solr/select/?q=id:3.40.50.720tv=truetv.all=truetv.fl=keywords I see that and has a measly tf.idf of 7.46E-4. But there are other terms with *much* higher tf.idf scores, e.g.: lst name=aquaspirillum int name=tf1/int int name=df10/int double name=tf-idf0.1/double /lst that *don't* appear in the MoreLikeThis list. (I tried adding mlt.maxwl=999 to the end of the MLT query but it makes no difference.) What's going on? Surely something with tf.idf = 0.1 is a far better candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or does MoreLikeThis do some other heuristic magic to select good candidates, and sometimes get it wrong? BTW the keywords field is indexed, stored, multi-valued and term-vectored. Thanks, Andrew. -- :: http://biotext.org.uk/ :: -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selection of terms for MoreLikeThis
Chantal Ackermann wrote: no idea, I'm afraid - but could you sent the output of interestingTerms=details? This at least would show what MoreLikeThis uses, in comparison to the TermVectorComponent you've already pasted. I can, but I'm afraid they're not very illuminating! http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=detailsmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1 response lst name=responseHeader int name=status0/int int name=QTime59/int /lst result name=response numFound=280227 start=0/ lst name=interestingTerms float name=keywords:dehydrogenase1.0/float float name=keywords:reductase1.0/float float name=keywords:metabolism1.0/float float name=keywords:activity1.0/float float name=keywords:process1.0/float float name=keywords:alcohol1.0/float float name=keywords:and1.0/float float name=keywords:malate1.0/float float name=keywords:biosynthesis1.0/float float name=keywords:biosynthetic1.0/float float name=keywords:degradation1.0/float float name=keywords:precursor1.0/float float name=keywords:metabolic1.0/float float name=keywords:protein1.0/float float name=keywords:synthase1.0/float float name=keywords:acid1.0/float float name=keywords:enzyme1.0/float float name=keywords:succinyl-coa1.0/float float name=keywords:putative1.0/float float name=keywords:(nadp+)1.0/float float name=keywords:4,6-dehydratase1.0/float float name=keywords:fatty1.0/float float name=keywords:chloroplast1.0/float float name=keywords:lactobacillus1.0/float float name=keywords:glyoxylate1.0/float /lst /response Cheers, Andrew. -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selection of terms for MoreLikeThis
Hi Andrew, your URL does not include the parameter mlt.boost. Setting that to true made a noticeable difference for my queries. If not, there is also the parameter mlt.minwl minimum word length below which words will be ignored. All your other terms seem longer than 3, so it would help in this case? But seems a bit like work around. Cheers, Chantal Andrew Clegg schrieb: Chantal Ackermann wrote: no idea, I'm afraid - but could you sent the output of interestingTerms=details? This at least would show what MoreLikeThis uses, in comparison to the TermVectorComponent you've already pasted. I can, but I'm afraid they're not very illuminating! http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=detailsmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1 response lst name=responseHeader int name=status0/int int name=QTime59/int /lst result name=response numFound=280227 start=0/ lst name=interestingTerms float name=keywords:dehydrogenase1.0/float float name=keywords:reductase1.0/float float name=keywords:metabolism1.0/float float name=keywords:activity1.0/float float name=keywords:process1.0/float float name=keywords:alcohol1.0/float float name=keywords:and1.0/float float name=keywords:malate1.0/float float name=keywords:biosynthesis1.0/float float name=keywords:biosynthetic1.0/float float name=keywords:degradation1.0/float float name=keywords:precursor1.0/float float name=keywords:metabolic1.0/float float name=keywords:protein1.0/float float name=keywords:synthase1.0/float float name=keywords:acid1.0/float float name=keywords:enzyme1.0/float float name=keywords:succinyl-coa1.0/float float name=keywords:putative1.0/float float name=keywords:(nadp+)1.0/float float name=keywords:4,6-dehydratase1.0/float float name=keywords:fatty1.0/float float name=keywords:chloroplast1.0/float float name=keywords:lactobacillus1.0/float float name=keywords:glyoxylate1.0/float /lst /response Cheers, Andrew. -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html Sent from the Solr - User mailing list archive at Nabble.com.
Selection of terms for MoreLikeThis
Hi, If I run a MoreLikeThis query like the following: http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1 one of the hits in the results is and (I don't do any stopword removal on this field). However if I look inside that document with the TermVectorComponent: http://www.cathdb.info/solr/select/?q=id:3.40.50.720tv=truetv.all=truetv.fl=keywords I see that and has a measly tf.idf of 7.46E-4. But there are other terms with *much* higher tf.idf scores, e.g.: lst name=aquaspirillum int name=tf1/int int name=df10/int double name=tf-idf0.1/double /lst that *don't* appear in the MoreLikeThis list. (I tried adding mlt.maxwl=999 to the end of the MLT query but it makes no difference.) What's going on? Surely something with tf.idf = 0.1 is a far better candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or does MoreLikeThis do some other heuristic magic to select good candidates, and sometimes get it wrong? BTW the keywords field is indexed, stored, multi-valued and term-vectored. Thanks, Andrew. -- :: http://biotext.org.uk/ :: -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26286005.html Sent from the Solr - User mailing list archive at Nabble.com.