i run
tika 2.8.0
it's used to attachment scan for a dovecot imap server
it runs on an external (to dovecot) server, on the same lan
it's up & running
ps ax | grep tika
63506 ? Ssl 0:00 /usr/bin/java
-Dpdfbox.fontcache=/var/tika -XX:ParallelGCThreads=1 -XX:CICompilerCount=2
-XX:-CICompilerCountPerCPU -jar /srv/apps/tika/tika-server.jar -c
/usr/local/etc/tika/tika-server-config-custom.xml --host 10.1.7.100 --port 9998
63540 ? Sl 0:02 /usr/bin/java -Xms1g -Xmx1g
-Dpdfbox.fontcache=/var/tika -Dlog4j2.warn -Djava.awt.headless=true -cp
/srv/apps/tika/tika-server.jar -Dtika.server.id=
org.apache.tika.server.core.TikaServerProcess -h 10.1.7.100 -p 9998 -i -c
/usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile
/tmp/apache-tika-server-forked-tmp-15836749653669077604 -numRestarts 0
dovecot config for using tika instance is
fts_tika = http://10.1.7.100:9998/tika/
testing a local PDF on the tika server
F="/tmp/TEST.pdf"
/bin/cp -af $F /tmp/test.pdf
chown vmail:vmail /tmp/test.pdf
curl \
-T /tmp/test.pdf \
http://10.1.7.100:9998/meta
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core
Test.SNAPSHOT">
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/"
pdf:PDFVersion="1.4"
pdf:hasXFA="false"
pdf:num3DAnnotations="0"
pdf:overallPercentageUnmappedUnicodeChars="0.0"
pdf:hasCollection="false"
pdf:encrypted="false"
pdf:containsNonEmbeddedFont="false"
pdf:hasMarkedContent="true"
pdf:producer="Adobe PDF Library 15.0"
pdf:totalUnmappedUnicodeChars="0"
pdf:hasXMP="true"
pdf:containsDamagedFont="false"
xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
dc:format="application/pdf; version=1.4"
dc:language="en-US"
xmpMM:DocumentID="xmp.id:8a612346-9d03-4caf-8ebf-da6f3716ed0a"
xmpTPg:NPages="14">
<pdf:unmappedUnicodeCharsPerPage>
<rdf:Seq>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
<rdf:li>0</rdf:li>
</rdf:Seq>
</pdf:unmappedUnicodeCharsPerPage>
<pdf:charsPerPage>
<rdf:Seq>
<rdf:li>84</rdf:li>
<rdf:li>676</rdf:li>
<rdf:li>1653</rdf:li>
<rdf:li>1914</rdf:li>
<rdf:li>814</rdf:li>
<rdf:li>1022</rdf:li>
<rdf:li>645</rdf:li>
<rdf:li>1221</rdf:li>
<rdf:li>1087</rdf:li>
<rdf:li>732</rdf:li>
<rdf:li>887</rdf:li>
<rdf:li>1295</rdf:li>
<rdf:li>1263</rdf:li>
<rdf:li>149</rdf:li>
</rdf:Seq>
</pdf:charsPerPage>
<pdf:annotationTypes>
<rdf:Bag>
<rdf:li>null</rdf:li>
</rdf:Bag>
</pdf:annotationTypes>
<pdf:annotationSubtypes>
<rdf:Bag>
<rdf:li>Link</rdf:li>
</rdf:Bag>
</pdf:annotationSubtypes>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
passing/processing an email with an *.pdf attachment from dovecot, logs ok,
Jul 11 08:12:50 svr003 tika[63540]: INFO [qtp1164394344-41]
09:12:50,042 org.apache.tika.server.core.TikaLoggingFilter Request URI:
http://10.1.7.100:9998/tika/
Jul 11 08:12:50 svr003 tika[63540]: INFO [qtp1164394344-41]
09:12:50,043 org.apache.tika.server.core.resource.TikaResource /tika
(application/pdf)
and results are passed back to dovecot, and scan/index db is updated accordingly
but passing/processing an email with an embedded (forwarded as attachment)
*.eml, logs the following 'SEVERE' error,
Jul 11 08:36:49 svr003 tika[62540]: INFO [qtp1164241227-41]
08:36:49,417 org.apache.tika.server.core.TikaLoggingFilter Request URI:
http://10.1.7.100:9998/tika/
Jul 11 08:36:49 svr003 tika[62540]: INFO [qtp1164241227-41]
08:36:49,418 org.apache.tika.server.core.resource.TikaResource /tika
(message/rfc822)
Jul 11 08:36:49 svr003 tika[62540]: WARN [qtp1164241227-41]
08:36:49,419 org.apache.tika.server.core.resource.TikaResource tika/: Text
extraction failed ([0-9961000034519].eml)
Jul 11 08:36:49 svr003 tika[62540]:
org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:185)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:57)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:357)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:507)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1651)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
[tika-server-standard-2.8.0.jar:2.8.0]
Jul 11 08:36:49 svr003 tika[62540]: at
java.lang.Thread.run(Thread.java:833) [?:?]
Jul 11 08:36:49 svr003 tika[62540]: Jul 11, 2023 8:36:49 AM
org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
Jul 11 08:36:49 svr003 tika[62540]: SEVERE: Problem with writing the
data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$371/0x00000008012ab9e0,
ContentType: text/plain
iiuc, .eml should be parseable
https://tika.apache.org/2.8.0/formats.html#Mail_formats
https://tika.apache.org/2.8.0/api/org/apache/tika/parser/mail/RFC822Parser.html
is there additional/different config needed for .eml processing ?