Hello,

Is it possible to manipulate the value of a field before it is stored?

I'm indexing a database where some field contain raw HTML, including
named character entities.

Using solr.HTMLStripCharFilterFactory on the index analyzer, results
in this HTML being correctly stripped, and named character entities
replaced by the corresponding characters, in the index (as verified
when searching, and with Luke).

But, the stored values of the documents are stored unmodified, so the
result sets, including highlights, contain HTML tags (that are
escaped) and "entities" (where the leading '&' is also escaped) which
make handling the results quite difficult.

So, is it possible to apply some filters to the data before it is
stored in the non-indexed fields?

I couldn't find a part of the documentation that said whether it was
possible or not; I did find this message in the archives of this list:

    > From: Noble Paul
    > Sent: Tuesday, March 31, 2009 5:41 PM
    > Subject: Re: indexed fields vs stored fields
    >
    > indexed = can be searched (mean you can use this to query). This
undergoes tokenization filter etc
    > stored = can be retrieved. No modification to the data. This is
stored verbatim

which seems to say that it is not possible; but maybe things have
changed since then?

Any other idea? given that:
- I have zero control over what is stored in the database
- using the Solr XML update protocol i could probably transform the
data before sending it
- ... but I'd much rather continue using DataImportHandler to access
the database

Thanks,
Regards,
EB

Reply via email to