HI-

I am working a contract to index some wordpress data.  For the posts I of
course have html in the content of the column, I'd like to strip it out.
 Here is my data importer config

<dataConfig>
    <dataSource driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/econetsm" user="*******" password="*******"
/>
    <document>
            <entity name="post" transformer="HTMLStripTransformer"
query="SELECT id, post_content, post_title FROM elinstmkting_posts e"
onError="abort"
                deltaQuery="SELECT * FROM elinstmkting_posts e where
post_modified_gmt > '${dataimporter.last_index_time}'">
           <field column="POST_TITLE" name="post_title" stripHTML="false"/>
            <field column="POST_CONTENT" name="post_content"
stripHTML="true"  />
        </entity>
    </document>
</dataConfig>

Looks perfect according to the wiki docs, but the html is found when I
search for "strong" (<strong> tag) and html is returned in the field.

I assume I am doing something stupid wrong, I am using the latest stable
solr (1.4.0).

Does it matter that the post data is not a complete html document (it
doesn't have a <html> start tag or a <body> tag)?

James

Reply via email to