Hi, You can create a custom update request processor [1] to strip unwanted input as it is about to enter the index.
[1]: http://wiki.apache.org/solr/UpdateRequestProcessor Cheers, On Monday 06 December 2010 17:36:09 Emmanuel Bégué wrote: > Hello, > > Is it possible to manipulate the value of a field before it is stored? > > I'm indexing a database where some field contain raw HTML, including > named character entities. > > Using solr.HTMLStripCharFilterFactory on the index analyzer, results > in this HTML being correctly stripped, and named character entities > replaced by the corresponding characters, in the index (as verified > when searching, and with Luke). > > But, the stored values of the documents are stored unmodified, so the > result sets, including highlights, contain HTML tags (that are > escaped) and "entities" (where the leading '&' is also escaped) which > make handling the results quite difficult. > > So, is it possible to apply some filters to the data before it is > stored in the non-indexed fields? > > I couldn't find a part of the documentation that said whether it was > > possible or not; I did find this message in the archives of this list: > > From: Noble Paul > > Sent: Tuesday, March 31, 2009 5:41 PM > > Subject: Re: indexed fields vs stored fields > > > > indexed = can be searched (mean you can use this to query). This > > undergoes tokenization filter etc > > > stored = can be retrieved. No modification to the data. This is > > stored verbatim > > which seems to say that it is not possible; but maybe things have > changed since then? > > Any other idea? given that: > - I have zero control over what is stored in the database > - using the Solr XML update protocol i could probably transform the > data before sending it > - ... but I'd much rather continue using DataImportHandler to access > the database > > Thanks, > Regards, > EB -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350