Re: Indexing HTML Content

David Arpad Geller Thu, 22 May 2008 03:49:49 -0700

Actually, it's very easy: http://us2.php.net/strip_tags

I also store the data in a separate field with the html intact fordisplay. In that case, I use urlencode on the string.


David

McBride, John wrote:

Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John


--
They must find it difficult, those who have taken authority as truth, rather 
than truth as authority. - Gerald Massey

Re: Indexing HTML Content

Reply via email to