Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that "reindexes", i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need.
I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir <rcm...@gmail.com> wrote: > it does: > http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html > this takes a sample of the file and makes a guess.