Looks like a solr/tika issue with JPEG file metadata extraction:

https://issues.apache.org/jira/browse/SOLR-4645

The JIRA issue contains a workaround which looks reasonable.  I should note 
that I haven't tried this...

Adrian

From: Ronny Heylen [mailto:[email protected]]
Sent: 29 October 2013 15:35
To: Karl Wright; Adrian Conlon
Cc: [email protected]
Subject: Re: Error in Manifoldcf, what's the first step?

The help on file size was great, now we still have the problem on small jpg.
solr.log contains:

ERROR - 2013-10-29 15:47:19.815; org.apache.solr.common.SolrException; 
null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
com/adobe/xmp/XMPException
    at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
    at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
    at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
    at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
    at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
    at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
    at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
    at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
    at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
    at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
    at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
    at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
    at 
com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
    at 
com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
    at 
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
    at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
    at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
    at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
    ... 16 more
Caused by: java.lang.ClassNotFoundException: com.adobe.xmp.XMPException
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 30 more

On Tue, Oct 29, 2013 at 1:25 PM, Ronny Heylen 
<[email protected]<mailto:[email protected]>> wrote:
That was a very good suggestion!
Setting the max size has solved the problem for the first subfolder on which we 
test.
Now we wil retry on the full drive and let you know the result.

On Tue, Oct 29, 2013 at 12:12 PM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Based on the error message, Adrian is correct and this is once again a solr 
side problem.  Since solr puts all documents into memory, my guess is that you 
are attempting to index some very large documents and those are causing solr to 
run out of memory.  Either exclude these from the crawl or set a reasonable 
maximum length.

Karl

Sent from my Windows Phone
________________________________
From: Ronny Heylen
Sent: 10/29/2013 6:52 AM

To: [email protected]<mailto:[email protected]>
Subject: Error in Manifoldcf, what's the first step?
Hi,

Solr is 4.4, manifoldcf 1.3.

We are indexing a shared windows network drive, filtering on *.doc*, *.xls*, 
*.pdf ... with about 650,000 files to index, giving a SOLR index 35GB in size.

The result is great except that the manifoldcf job crashes before the end.

Note that:
- ignoreTikaException is true in solrconfig.xml (otherwise the manifoldcf job 
stops very early).
- tomcat has been given 24 GB of memory (it uses 15GB)
- there are 8 cores

Message in http://localhost:8080/mcf-crawler-ui/showjobstatus.jsp is:
Error: Repeated service interruptions - failure processing document: Server at 
http://localhost:8080/solr/collection1 returned non ok status:500, 
message:Internal Server Error
Then, instead of indexing the full drive in one job, we have defined one job 
for each subfolder.
Almost all "subfolder" jobs end successfully, only for 2 or 3 we receive the 
same message, and for 2 or 3 other ones a different message:

Error: Repeated service interruptions - failure processing document: Read timed 
out
If we try to go further (defining one job for each subfolder of a subfolder in 
error), the same happens: success for almost all subfolders except 1 or 2.
What is the first step to do to solve this problem?
Thanks.


____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

Reply via email to