Re: Indexing HTML

Mike Klaas Thu, 04 Oct 2007 13:51:57 -0700

On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote:


Because of this I cannot present the resulting html in a webpage.  Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead?  But
doesn't Solr use HTML features while searching (anchors/titles etc).

Why is there no documentation about indexing HTML specifically using
solr.  How does nutch do it?  does it strip out html in the snippets
it returns?

Solr isn't a web search engine, and doesn't do any special processingof html (although you can ask it to strip html if you want).

I recommend stripping the html yourself, and putting titles, anchors,etc in separate fields.

I believe that it would be possible to write this as a Solr update-handler plugin, if you wanted it to all run in one place.


-Mike

Re: Indexing HTML

Reply via email to