Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Peter Karich
take a look also into icu4j which is one of the contrib projects ... converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html We've created an

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
Scanning for only 'valid' utf-8 is definitely not simple. You can eliminate some obviously not valid utf-8 things by byte ranges, but you can't confirm valid utf-8 alone by byte ranges. There are some bytes that can only come after or before other certain bytes to be valid utf-8. There is no

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Michael McCandless
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so the first place where invalid UTF8 is detected/corrected/etc. is during your analysis process, which takes your raw content and produces char[] based tokens. Second, during indexing, Lucene ensures that the incoming char[]

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Robert Muir
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind rochk...@jhu.edu wrote: There are various packages of such heuristic algorithms to guess char encoding, I wouldn't try to write my own. icu4j might include such an algorithm, not sure. it does:

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Paul
Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that reindexes, i.e., reads all documents out of one index, does some transform

RE: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
@lucene.apache.org Subject: Re: verifying that an index contains ONLY utf-8 Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app

Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Markus Jelsma
This is supposed to be dealt with outside the index. All input must be UTF-8 encoded. Failing to do so will give unexpected results. We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I

Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Peter Karich
converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html We've created an index from a number of different documents that are supplied by third