So you're allowed to put the entire original document in a stored field in Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, beaurocracy. But no reason what you are doing won't work, as you of course already know from doing it.
If you actually know the charset of a document when indexing it, you might want to consider putting THAT in a stored field; easier to keep track of the encoding you know then to try and guess it again later. ________________________________________ From: Paul [p...@nines.org] Sent: Thursday, January 13, 2011 6:21 PM To: solr-user@lucene.apache.org Subject: Re: verifying that an index contains ONLY utf-8 Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that "reindexes", i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need. I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir <rcm...@gmail.com> wrote: > it does: > http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html > this takes a sample of the file and makes a guess.