Actually, it's very easy: http://us2.php.net/strip_tags
I also store the data in a separate field with the html intact for display. In that case, I use urlencode on the string.
David McBride, John wrote:
Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for a solution to work on a batch of files before being added to SOLR. Thanks, John
-- They must find it difficult, those who have taken authority as truth, rather than truth as authority. - Gerald Massey