Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

Markus Jelsma Mon, 27 Jun 2011 07:54:44 -0700


On Monday 27 June 2011 16:33:16 lee carroll wrote:
> Hi Markus
> 
> I've seen similar issue before (but not with solr) when processing files as
> xml. In our case the problem was due to processing a utf16 file with a
> byte order mark. This presents itself as
> 0xffff to the xml parser which is not used by utf8 (the bom unicode
> would be represented as efbfbf in utf8) This caused the utf8
> aware parser to choke.
> 
> I don't want to get involved in any unicode / utf war as I'm confused
> enough as it stands but
> could you check for utf16 files before processing ?


Some files may be UTF-16 but i cannot confirm it right now. On the other hand, 
Nutch should have no trouble processing UTF-16.

> 
> lee c
> 
> On 27 June 2011 14:26, Thomas Fischer <fischer...@aon.at> wrote:
> > Hello,
> > 
> > Am 27.06.2011 um 12:40 schrieb Markus Jelsma:
> >> Hi,
> >> 
> >> I came across the indexing error below. It happened in a huge batch
> >> update from Nutch with SolrJ 3.1. Since the crawl was huge it is very
> >> hard to trace the error back to a specific document. So i try my luck
> >> here: anyone seen this before with SolrJ 3.1? Anything else on the
> >> Nutch part i should have taken care off?
> >> 
> >> Thanks!
> >> 
> >> 
> >> Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
> >> INFO: [] webapp=/solr path=/update params={wt=javabin&version=2}
> >> status=500 QTime=423 Jun 27, 2011 10:24:28 AM
> >> org.apache.solr.common.SolrException log SEVERE:
> >> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >> Invalid UTF-8 character 0xffff at char #1142033, byte #1155068) at
> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav
> >> a:18)
> > 
> > and loads of other rubbish and
> > 
> >>       ... 26 more
> > 
> > I see this as a problem of solr error-reporting. This is not only
> > obnoxiously "loud" (white on grey with oversized fonts), but less useful
> > than it should be. Instead of telling the user where the error occurred
> > (i.e. while reading which file, which column at which line) it unravels
> > the stack. This is useless if the program just choked on some unexpected
> > input, like a typo in a schema of config file or an invalid character in
> > a file to be indexed. I don't know if this is due to the Tomcat, the
> > logging system of solr itself, but it is annoying.
> > 
> > And yes, I've seen something like this before and found the error not by
> > inspecting solr but by opening the suspected files with an appropriate
> > browser (e.g. Firefox) which tells me exactly where something goes
> > wrong.
> > 
> > All the best
> > Thomas

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

Reply via email to