The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so
the first place where invalid UTF8 is detected/corrected/etc. is
during your analysis process, which takes your raw content and
produces char[] based tokens.

Second, during indexing, Lucene ensures that the incoming char[]
tokens are valid UTF16.

If an invalid char sequence is hit, eg naked (unpaired) surrogate, or
invalid surrogate pair, the behavior is undefined, but, today, Lucene
will replace such invalid char/s with the unicode character U+FFFD, so
you could iterate all terms looking for that replacement char.

Mike

On Wed, Jan 12, 2011 at 5:16 PM, Paul <p...@nines.org> wrote:
> We've created an index from a number of different documents that are
> supplied by third parties. We want the index to only contain UTF-8
> encoded characters. I have a couple questions about this:
>
> 1) Is there any way to be sure during indexing (by setting something
> in the solr configuration?) that the documents that we index will
> always be stored in utf-8? Can solr convert documents that need
> converting on the fly, or can solr reject documents containing illegal
> characters?
>
> 2) Is there a way to scan the existing index to find any string
> containing non-utf8 characters? Or is there another way that I can
> discover if any crept into my index?
>

Reply via email to