I'm using the MoreLikeThisHandler with a content stream to get documents from my index that match content from an html page like this: http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
But, not surprisingly, the query generated is meaningless because a lot of the markup is picked out as terms: <str name="parsedquery_toString"> body:li body:href body:div body:class body:a body:script body:type body:js body:ul body:text body:javascript body:style body:css body:h body:img body:var body:articl body:ad body:http body:span body:prop </str> Does anyone know a way to transform the html so that the content can be parsed out of the content stream and processed w/o the markup? Or do I need to write my own HTMLParsingMoreLikeThisHandler? If I parse the content out to a plain text file and point the stream.url param to file:///parsedfile.txt it works great. -Jay