Re: Solr xml img parsing exception

2013-11-15 Thread Marcello Lorenzi

Hi Jack,
we have analyzed the issue and there were duplicated jar into the tomcat 
classpath for Tika. After the removal of the dulicated library now the 
search engine works as expected.


Thanks for the support,
Marcello

On 11/14/2013 05:24 PM, Jack Krupansky wrote:

The actual error appears to be:

Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.

So, check the input document at line 91, column 105. There should be 
an img tag there, but SAX is complaining that there is no matching 
/img.


-- Jack Krupansky

-Original Message- From: Marcello Lorenzi
Sent: Thursday, November 14, 2013 9:26 AM
To: solr-user@lucene.apache.org
Subject: Solr xml img parsing exception

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf
to pass web content to the shard collection, but during the crawling we
have noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150) 


at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) 


at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) 


at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242) 


at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) 


at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) 


at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) 


at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) 


at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107) 


at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) 


at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76) 


at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90) 


at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515) 


at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012) 


at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642) 


at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223) 


at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597) 


at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555) 


at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 


at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 


at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147) 


... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) 


at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) 


at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) 


at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) 


at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) 


at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753

Solr xml img parsing exception

2013-11-14 Thread Marcello Lorenzi

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf 
to pass web content to the shard collection, but during the crawling we 
have noticed a lot of this exception:


ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; 
org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: XML parse error
at 
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)

... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 
105; The element type img must be terminated by the matching end-tag 
/img.
at 
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at 
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at 
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at 
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at 
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at 

Re: Solr xml img parsing exception

2013-11-14 Thread Erick Erickson
It looks like bad data. The XML you're sending to Solr looks mal-formed, so
I
suspect this is completely outside of Solr's purview.

Best,
Erick


On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote:

 Hi,
 I have installed a Solr 4.3 instance and we have configured manifoldcf to
 pass web content to the shard collection, but during the crawling we have
 noticed a lot of this exception:

 ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
 XML parse error
 at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
 CwsExtractingDocumentLoader.java:150)
 at org.apache.solr.handler.ContentStreamHandlerBase.
 handleRequestBody(ContentStreamHandlerBase.java:74)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest(
 RequestHandlerBase.java:135)
 at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
 handleRequest(RequestHandlers.java:242)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
 at org.apache.solr.servlet.SolrDispatchFilter.execute(
 SolrDispatchFilter.java:656)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:359)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:155)
 at org.apache.catalina.core.ApplicationFilterChain.
 internalDoFilter(ApplicationFilterChain.java:241)
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(
 ApplicationFilterChain.java:208)
 at org.apache.catalina.core.StandardWrapperValve.invoke(
 StandardWrapperValve.java:221)
 at org.apache.catalina.core.StandardContextValve.invoke(
 StandardContextValve.java:107)
 at org.apache.catalina.core.StandardHostValve.invoke(
 StandardHostValve.java:155)
 at org.apache.catalina.valves.ErrorReportValve.invoke(
 ErrorReportValve.java:76)
 at org.apache.catalina.valves.AccessLogValve.invoke(
 AccessLogValve.java:934)
 at org.apache.catalina.core.StandardEngineValve.invoke(
 StandardEngineValve.java:90)
 at org.apache.catalina.connector.CoyoteAdapter.service(
 CoyoteAdapter.java:515)
 at org.apache.coyote.http11.AbstractHttp11Processor.process(
 AbstractHttp11Processor.java:1012)
 at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
 process(AbstractProtocol.java:642)
 at org.apache.coyote.http11.Http11NioProtocol$
 Http11ConnectionHandler.process(Http11NioProtocol.java:223)
 at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
 doRun(NioEndpoint.java:1597)
 at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
 run(NioEndpoint.java:1555)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: org.apache.tika.exception.TikaException: XML parse error
 at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
 at org.apache.tika.parser.CompositeParser.parse(
 CompositeParser.java:242)
 at org.apache.tika.parser.CompositeParser.parse(
 CompositeParser.java:242)
 at org.apache.tika.parser.AutoDetectParser.parse(
 AutoDetectParser.java:120)
 at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
 CwsExtractingDocumentLoader.java:147)
 ... 24 more
 Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
 105; The element type img must be terminated by the matching end-tag
 /img.
 at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
 createSAXParseException(ErrorHandlerWrapper.java:198)
 at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
 fatalError(ErrorHandlerWrapper.java:177)
 at com.sun.org.apache.xerces.internal.impl.
 XMLErrorReporter.reportError(XMLErrorReporter.java:441)
 at com.sun.org.apache.xerces.internal.impl.
 XMLErrorReporter.reportError(XMLErrorReporter.java:368)
 at com.sun.org.apache.xerces.internal.impl.XMLScanner.
 reportFatalError(XMLScanner.java:1388)
 at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentFragmentScannerImpl.scanEndElement(
 XMLDocumentFragmentScannerImpl.java:1753)
 at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
 XMLDocumentFragmentScannerImpl.java:2951)
 at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
 at com.sun.org.apache.xerces.internal.impl.
 XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
 at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl
 .java:511)
 at com.sun.org.apache.xerces.internal.parsers.
 

Re: Solr xml img parsing exception

2013-11-14 Thread Erik Hatcher
Also there's a custom loader here that is the culprit:  
com.lsegroup.solr.handler.CwsExtractingDocumentLoader

On Nov 14, 2013, at 10:20, Erick Erickson erickerick...@gmail.com wrote:

 It looks like bad data. The XML you're sending to Solr looks mal-formed, so
 I
 suspect this is completely outside of Solr's purview.
 
 Best,
 Erick
 
 
 On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote:
 
 Hi,
 I have installed a Solr 4.3 instance and we have configured manifoldcf to
 pass web content to the shard collection, but during the crawling we have
 noticed a lot of this exception:
 
 ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
 org.apache.solr.common.SolrException: 
 org.apache.tika.exception.TikaException:
 XML parse error
at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
 CwsExtractingDocumentLoader.java:150)
at org.apache.solr.handler.ContentStreamHandlerBase.
 handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
 RequestHandlerBase.java:135)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
 handleRequest(RequestHandlers.java:242)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
 SolrDispatchFilter.java:656)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:359)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:155)
at org.apache.catalina.core.ApplicationFilterChain.
 internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
 ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(
 StandardWrapperValve.java:221)
at org.apache.catalina.core.StandardContextValve.invoke(
 StandardContextValve.java:107)
at org.apache.catalina.core.StandardHostValve.invoke(
 StandardHostValve.java:155)
at org.apache.catalina.valves.ErrorReportValve.invoke(
 ErrorReportValve.java:76)
at org.apache.catalina.valves.AccessLogValve.invoke(
 AccessLogValve.java:934)
at org.apache.catalina.core.StandardEngineValve.invoke(
 StandardEngineValve.java:90)
at org.apache.catalina.connector.CoyoteAdapter.service(
 CoyoteAdapter.java:515)
at org.apache.coyote.http11.AbstractHttp11Processor.process(
 AbstractHttp11Processor.java:1012)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
 process(AbstractProtocol.java:642)
at org.apache.coyote.http11.Http11NioProtocol$
 Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
 doRun(NioEndpoint.java:1597)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
 run(NioEndpoint.java:1555)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at org.apache.tika.parser.CompositeParser.parse(
 CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(
 CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(
 AutoDetectParser.java:120)
at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
 CwsExtractingDocumentLoader.java:147)
... 24 more
 Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
 105; The element type img must be terminated by the matching end-tag
 /img.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
 createSAXParseException(ErrorHandlerWrapper.java:198)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
 fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.
 XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at com.sun.org.apache.xerces.internal.impl.
 XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.
 reportFatalError(XMLScanner.java:1388)
at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentFragmentScannerImpl.scanEndElement(
 XMLDocumentFragmentScannerImpl.java:1753)
at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
 XMLDocumentFragmentScannerImpl.java:2951)
at com.sun.org.apache.xerces.internal.impl.
 XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.
 XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at 

Re: Solr xml img parsing exception

2013-11-14 Thread Marcello Lorenzi

Hi Erik,
but in this case the custom loader receives an HTTP Error 500 by SOLR?

Thanks,
Marcello
On 11/14/2013 04:29 PM, Erik Hatcher wrote:

Also there's a custom loader here that is the culprit:  
com.lsegroup.solr.handler.CwsExtractingDocumentLoader

On Nov 14, 2013, at 10:20, Erick Erickson erickerick...@gmail.com wrote:


It looks like bad data. The XML you're sending to Solr looks mal-formed, so
I
suspect this is completely outside of Solr's purview.

Best,
Erick


On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote:


Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf to
pass web content to the shard collection, but during the crawling we have
noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
XML parse error
at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
CwsExtractingDocumentLoader.java:150)
at org.apache.solr.handler.ContentStreamHandlerBase.
handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:135)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
handleRequest(RequestHandlers.java:242)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:656)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:359)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:155)
at org.apache.catalina.core.ApplicationFilterChain.
internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:221)
at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:107)
at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:155)
at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:76)
at org.apache.catalina.valves.AccessLogValve.invoke(
AccessLogValve.java:934)
at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:90)
at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:515)
at org.apache.coyote.http11.AbstractHttp11Processor.process(
AbstractHttp11Processor.java:1012)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
process(AbstractProtocol.java:642)
at org.apache.coyote.http11.Http11NioProtocol$
Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
doRun(NioEndpoint.java:1597)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
run(NioEndpoint.java:1555)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(
AutoDetectParser.java:120)
at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
CwsExtractingDocumentLoader.java:147)
... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
createSAXParseException(ErrorHandlerWrapper.java:198)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.
XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at com.sun.org.apache.xerces.internal.impl.
XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.
reportFatalError(XMLScanner.java:1388)
at com.sun.org.apache.xerces.internal.impl.
XMLDocumentFragmentScannerImpl.scanEndElement(
XMLDocumentFragmentScannerImpl.java:1753)
at com.sun.org.apache.xerces.internal.impl.
XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
XMLDocumentFragmentScannerImpl.java:2951)
at com.sun.org.apache.xerces.internal.impl.
XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.

Re: Solr xml img parsing exception

2013-11-14 Thread Jack Krupansky

The actual error appears to be:

Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.

So, check the input document at line 91, column 105. There should be an 
img tag there, but SAX is complaining that there is no matching /img.


-- Jack Krupansky

-Original Message- 
From: Marcello Lorenzi

Sent: Thursday, November 14, 2013 9:26 AM
To: solr-user@lucene.apache.org
Subject: Solr xml img parsing exception

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf
to pass web content to the shard collection, but during the crawling we
have noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)
... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116