take a look also into icu4j which is one of the contrib projects ...
converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html
We've created an
Scanning for only 'valid' utf-8 is definitely not simple. You can
eliminate some obviously not valid utf-8 things by byte ranges, but you
can't confirm valid utf-8 alone by byte ranges. There are some bytes
that can only come after or before other certain bytes to be valid utf-8.
There is no
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so
the first place where invalid UTF8 is detected/corrected/etc. is
during your analysis process, which takes your raw content and
produces char[] based tokens.
Second, during indexing, Lucene ensures that the incoming char[]
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
There are various packages of such heuristic algorithms to guess char
encoding, I wouldn't try to write my own. icu4j might include such an
algorithm, not sure.
it does:
Thanks for all the responses.
CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that reindexes,
i.e., reads all documents out of one index, does some transform
@lucene.apache.org
Subject: Re: verifying that an index contains ONLY utf-8
Thanks for all the responses.
CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app
This is supposed to be dealt with outside the index. All input must be UTF-8
encoded. Failing to do so will give unexpected results.
We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I
converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html
We've created an index from a number of different documents that are
supplied by third