On Monday 27 June 2011 16:33:16 lee carroll wrote: > Hi Markus > > I've seen similar issue before (but not with solr) when processing files as > xml. In our case the problem was due to processing a utf16 file with a > byte order mark. This presents itself as > 0xffff to the xml parser which is not used by utf8 (the bom unicode > would be represented as efbfbf in utf8) This caused the utf8 > aware parser to choke. > > I don't want to get involved in any unicode / utf war as I'm confused > enough as it stands but > could you check for utf16 files before processing ?
Some files may be UTF-16 but i cannot confirm it right now. On the other hand, Nutch should have no trouble processing UTF-16. > > lee c > > On 27 June 2011 14:26, Thomas Fischer <fischer...@aon.at> wrote: > > Hello, > > > > Am 27.06.2011 um 12:40 schrieb Markus Jelsma: > >> Hi, > >> > >> I came across the indexing error below. It happened in a huge batch > >> update from Nutch with SolrJ 3.1. Since the crawl was huge it is very > >> hard to trace the error back to a specific document. So i try my luck > >> here: anyone seen this before with SolrJ 3.1? Anything else on the > >> Nutch part i should have taken care off? > >> > >> Thanks! > >> > >> > >> Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute > >> INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} > >> status=500 QTime=423 Jun 27, 2011 10:24:28 AM > >> org.apache.solr.common.SolrException log SEVERE: > >> java.lang.RuntimeException: [was class java.io.CharConversionException] > >> Invalid UTF-8 character 0xffff at char #1142033, byte #1155068) at > >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav > >> a:18) > > > > and loads of other rubbish and > > > >> ... 26 more > > > > I see this as a problem of solr error-reporting. This is not only > > obnoxiously "loud" (white on grey with oversized fonts), but less useful > > than it should be. Instead of telling the user where the error occurred > > (i.e. while reading which file, which column at which line) it unravels > > the stack. This is useless if the program just choked on some unexpected > > input, like a typo in a schema of config file or an invalid character in > > a file to be indexed. I don't know if this is due to the Tomcat, the > > logging system of solr itself, but it is annoying. > > > > And yes, I've seen something like this before and found the error not by > > inspecting solr but by opening the suspected files with an appropriate > > browser (e.g. Firefox) which tells me exactly where something goes > > wrong. > > > > All the best > > Thomas -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350