If you want to sanitize them during indexing, the regular expression
tools can do this. You would create a regular expression that matches
bogus elements. There is a regular expression transformer in the DIH,
and a regular expression CharFilter inside the Lucene text analysis
stack.

On Wed, Aug 15, 2012 at 2:10 PM, Michael Della Bitta
<michael.della.bi...@appinions.com> wrote:
> Hi, Jon,
>
> As far as I know, DataImportHandler doesn't transfer data to the rest
> of Solr via XML so it shouldn't be a problem...
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
> www.appinions.com
> Where Influence Isn’t a Game
>
>
> On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman <jdruk...@gmail.com> wrote:
>> I am pulling some fields from a mysql database using DataImportHandler and
>> some of them have invalid XML in them.  Does DataImportHandler do any kind
>> of filtering/sanitizing to ensure that it will go in OK or is it all on me?
>>
>> Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
>> quotes ("we’re")
>>
>> -jsd-



-- 
Lance Norskog
goks...@gmail.com

Reply via email to