HI- I am working a contract to index some wordpress data. For the posts I of course have html in the content of the column, I'd like to strip it out. Here is my data importer config
<dataConfig> <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/econetsm" user="*******" password="*******" /> <document> <entity name="post" transformer="HTMLStripTransformer" query="SELECT id, post_content, post_title FROM elinstmkting_posts e" onError="abort" deltaQuery="SELECT * FROM elinstmkting_posts e where post_modified_gmt > '${dataimporter.last_index_time}'"> <field column="POST_TITLE" name="post_title" stripHTML="false"/> <field column="POST_CONTENT" name="post_content" stripHTML="true" /> </entity> </document> </dataConfig> Looks perfect according to the wiki docs, but the html is found when I search for "strong" (<strong> tag) and html is returned in the field. I assume I am doing something stupid wrong, I am using the latest stable solr (1.4.0). Does it matter that the post data is not a complete html document (it doesn't have a <html> start tag or a <body> tag)? James