If you want to sanitize them during indexing, the regular expression tools can do this. You would create a regular expression that matches bogus elements. There is a regular expression transformer in the DIH, and a regular expression CharFilter inside the Lucene text analysis stack.
On Wed, Aug 15, 2012 at 2:10 PM, Michael Della Bitta <michael.della.bi...@appinions.com> wrote: > Hi, Jon, > > As far as I know, DataImportHandler doesn't transfer data to the rest > of Solr via XML so it shouldn't be a problem... > > Michael Della Bitta > > ------------------------------------------------ > Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 > www.appinions.com > Where Influence Isn’t a Game > > > On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman <jdruk...@gmail.com> wrote: >> I am pulling some fields from a mysql database using DataImportHandler and >> some of them have invalid XML in them. Does DataImportHandler do any kind >> of filtering/sanitizing to ensure that it will go in OK or is it all on me? >> >> Example bad data: orphaned ampersands ("Peanut Butter & Jelly"), curly >> quotes ("we’re") >> >> -jsd- -- Lance Norskog goks...@gmail.com