RE: verifying that an index contains ONLY utf-8

Jonathan Rochkind Thu, 13 Jan 2011 15:35:33 -0800

So you're allowed to put the entire original document in a stored field in 
Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, 
beaurocracy. But no reason what you are doing won't work, as you of course 
already know from doing it.

If you actually know the charset of a document when indexing it, you might want 
to consider putting THAT in a stored field; easier to keep track of the 
encoding you know then to try and guess it again later. 

________________________________________
From: Paul [p...@nines.org]
Sent: Thursday, January 13, 2011 6:21 PM
To: solr-user@lucene.apache.org
Subject: Re: verifying that an index contains ONLY utf-8

Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that "reindexes",
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir <rcm...@gmail.com> wrote:
> it does: 
> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
> this takes a sample of the file and makes a guess.

RE: verifying that an index contains ONLY utf-8

Reply via email to