On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote:
Because of this I cannot present the resulting html in a webpage. Is it possible to strip out all HTML tags completely in result set? Would you recommend sending stripped out text to solr instead? But doesn't Solr use HTML features while searching (anchors/titles etc). Why is there no documentation about indexing HTML specifically using solr. How does nutch do it? does it strip out html in the snippets it returns?
Solr isn't a web search engine, and doesn't do any special processing of html (although you can ask it to strip html if you want).
I recommend stripping the html yourself, and putting titles, anchors, etc in separate fields.
I believe that it would be possible to write this as a Solr update- handler plugin, if you wanted it to all run in one place.
-Mike