Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that "reindexes",
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir <rcm...@gmail.com> wrote:
> it does: 
> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
> this takes a sample of the file and makes a guess.

Reply via email to