MoreLikeThis: How to get quality terms from html from content stream?

Jay Hill Fri, 07 Aug 2009 17:24:16 -0700

I'm using the MoreLikeThisHandler with a content stream to get documents
from my index that match content from an html page like this:
http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true


But, not surprisingly, the query generated is meaningless because a lot of
the markup is picked out as terms:
<str name="parsedquery_toString">
body:li body:href  body:div body:class body:a body:script body:type body:js
body:ul body:text body:javascript body:style body:css body:h body:img
body:var body:articl body:ad body:http body:span body:prop
</str>

Does anyone know a way to transform the html so that the content can be
parsed out of the content stream and processed w/o the markup? Or do I need
to write my own HTMLParsingMoreLikeThisHandler?

If I parse the content out to a plain text file and point the stream.url
param to file:///parsedfile.txt it works great.

-Jay

MoreLikeThis: How to get quality terms from html from content stream?

Reply via email to