It is hardcoded to process the `content` field only but it could be changed to 
process any string field. 
 
-----Original message-----
> From:Stefan Scheffler <[email protected]>
> Sent: Mon 24-Sep-2012 12:27
> To: [email protected]
> Subject: Re: Indexing Exception
> 
> Hey,
> Thank you i used this method in the meantime for me and it worked fine.
> Is there a general way to do the encoding to utf8 to this field in 
> Nutchg as well?
> 
> On 24.09.2012 12:04, Markus Jelsma wrote:
> > Hi Stefan,
> >
> > You can take the stripNonCharCodepoints() method and pass your content 
> > through it. It should fix the problem.
> >
> > Cheers,
> >   
> > -----Original message-----
> >> From:Stefan Scheffler <[email protected]>
> >> Sent: Mon 24-Sep-2012 11:23
> >> To: [email protected]
> >> Subject: Re: Indexing Exception
> >>
> >> Hey Markus. you gave me the right hint.
> >> Additionally to the normally content field i added a field fullcontent,
> >> which simply holds the html document of the relevant content field,
> >> because we need this in a later proccessing step. This field is not
> >> encoded like the content field. I realised this with an own
> >> ParsingFilter, which stores it in to  the ParseResult and then an
> >> Indexingfilter merges it into the NutchDocument.
> >>
> >> Is there a way to do this better or just do the encoding to the
> >> fullcontent like to the content?
> >>
> >> Regards
> >> Stefan
> >> On 24.09.2012 10:41, Markus Jelsma wrote:
> >>> It was fixed for the content field with 1016. Can you pinpoint the 
> >>> problematic field?
> >>> https://issues.apache.org/jira/browse/NUTCH-1016
> >>>
> >>>    
> >>>    
> >>> -----Original message-----
> >>>> From:Stefan Scheffler <[email protected]>
> >>>> Sent: Mon 24-Sep-2012 10:37
> >>>> To: [email protected]
> >>>> Subject: Re: Indexing Exception
> >>>>
> >>>> nutch 1.5, solr 3.6
> >>>> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>>>> Hi - What version?
> >>>>>
> >>>>>     
> >>>>>     
> >>>>> -----Original message-----
> >>>>>> From:Stefan Scheffler <[email protected]>
> >>>>>> Sent: Mon 24-Sep-2012 10:29
> >>>>>> To: [email protected]
> >>>>>> Subject: Indexing Exception
> >>>>>>
> >>>>>> Hello,
> >>>>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>>>> following exception
> >>>>>>
> >>>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>>>>         at
> >>>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>>>>         at
> >>>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>>>>         at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>>>>         at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>>>>         at 
> >>>>>> org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>>>>         at 
> >>>>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>>>>         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>>>>         at
> >>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>>>>         at
> >>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>>>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>>>>         at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>>>>         at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>>>>>         at org.mortbay.jetty.Server.handle(Server.java:326)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>>>>>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>>>>>         at 
> >>>>>> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>>>>>         at 
> >>>>>> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>>>>>         at
> >>>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>>>>>         at
> >>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >>>>>> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >>>>>> 0xfffe at char #6886708, byte #11578429)
> >>>>>> ...
> >>>>>>
> >>>>>> It seems to be an encoding exception. Is there a way to avoid this?
> >>>>>>
> >>>>>> Regards
> >>>>>> Stefan
> >>>>>>
> >>>>>> -- 
> >>>>>> Stefan Scheffler
> >>>>>> Avantgarde Labs GmbH
> >>>>>> Löbauer Straße 19, 01099 Dresden
> >>>>>> Telefon: + 49 (0) 351 21590834
> >>>>>> Email: [email protected]
> >>>>>>
> >>>>>>
> >>>> -- 
> >>>> Stefan Scheffler
> >>>> Avantgarde Labs GmbH
> >>>> Löbauer Straße 19, 01099 Dresden
> >>>> Telefon: + 49 (0) 351 21590834
> >>>> Email: [email protected]
> >>>>
> >>>>
> >>
> >> -- 
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: [email protected]
> >>
> >>
> 
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: [email protected]
> 
> 

Reply via email to