Scanning for only 'valid' utf-8 is definitely not simple. You can eliminate some obviously not valid utf-8 things by byte ranges, but you can't confirm valid utf-8 alone by byte ranges. There are some bytes that can only come after or before other certain bytes to be valid utf-8.

There is no good way to do what you're doing, once you've lost track of what encoding something is in, you are reduced to applying heuristics to text strings to guess what encoding it is meant to be.

There is no cheap way to do this to an entire Solr index, you're just going to have to fetch every single (stored field, indexed fields are pretty much lost to you) and apply heuristic algorithms to it. Keep in mind that Solr really probably shouldn't ever be used as your canonical _store_ of data; Solr isn't a 'store', it's an index. So you really ought to have this stuff stored somewhere else if you want to be able to examine it or modify it like this, and just deal with that somewhere else. This isn't really a Solr question at all, really, even if you are querying Solr on stored fields to try and guess their char encodings.

There are various packages of such heuristic algorithms to guess char encoding, I wouldn't try to write my own. icu4j might include such an algorithm, not sure.

On 1/13/2011 1:12 PM, Peter Karich wrote:
  take a look also into icu4j which is one of the contrib projects ...

converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html

We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?


Reply via email to