On 7/17/22 11:52 AM, Tilman Hausherr wrote:
https://issues.apache.org/jira/browse/TIKA-3819
This will show filename and length but only if logging is in DEBUG log level.
The modified version will appear at
https://repository.apache.org/content/groups/snapshots/org/apache/tika/
in a few hours.
thx o/
checking
https://issues.apache.org/jira/browse/TIKA-3819
i see
Fix Version/s: 2.4.2
https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/697/
Build #697 (Jul 17, 2022, 3:47:56 PM)
i installed
tika-server-standard-2.4.2-20220717.154907-90.jar
set
cat tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
! <logLevel>debug</logLevel>
...
<forkedJvmArgs>
...
! <arg>-Dlog4j2.debug</arg>
...
and launched,
systemctl status tika -l
● tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service; enabled;
vendor preset: disabled)
Active: active (running) since Sun 2022-07-17 20:51:36
EDT; 5min ago
Main PID: 25001 (java)
Tasks: 54 (limit: 8811)
Memory: 208.3M
CPU: 31.115s
CGroup: /system.slice/tika.service
├─ 25001 /usr/bin/java -jar
/srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
└─ 25039 /usr/bin/java -Xms1g -Xmx1g
-Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp
/srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess
-h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml
-forkedStatusFile /tmp/apache-tika-server-forked-tmp-8013562591697588923 -numRestarts 0
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: ... 37 more
Jul 17 20:52:15 mx-test tika[25039]: ERROR [qtp1401737458-25]
20:52:15,597 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the
data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78,
ContentType: text/plain
Jul 17 20:52:15 mx-test tika[25039]: TRACE StatusLogger
Log4jLoggerFactory.getContext() found anchor class
org.apache.cxf.common.logging.Slf4jLogger
on receipt of email + pdf attachment, FAIL as before,
journalctl -f -u tika
Jul 17 20:59:42 mx-test tika[25039]: INFO [qtp1401737458-25]
20:59:42,066 org.apache.tika.server.core.resource.TikaResource /tika
(application/pdf)
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25]
20:59:42,243 org.apache.pdfbox.pdfparser.COSParser The end of the stream
doesn't point to the correct offset, using workaround to read the stream,
stream start position: 104319, length: 366, expected end position: 104685
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25]
20:59:42,245 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
corrupt stream due to a DataFormatException
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25]
20:59:42,467 org.apache.pdfbox.pdfparser.COSParser The end of the stream
doesn't point to the correct offset, using workaround to read the stream,
stream start position: 101704, length: 1475, expected end position: 103179
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25]
20:59:42,469 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
corrupt stream due to a DataFormatException
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25]
20:59:42,481 org.apache.pdfbox.pdfparser.COSParser The end of the stream
doesn't point to the correct offset, using workaround to read the stream,
stream start position: 101514, length: 66, expected end position: 101580
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25]
20:59:42,482 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
corrupt stream due to a DataFormatException
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25]
20:59:42,493 org.apache.pdfbox.pdfparser.COSParser The end of the stream
doesn't point to the correct offset, using workaround to read the stream,
stream start position: 2011, length: 2482, expected end position: 4493
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25]
20:59:42,495 org.apache.tika.server.core.resource.TikaResource tika/: Text
extraction failed (Get_Started_With_Smallpdf.pdf)
Jul 17 20:59:42 mx-test tika[25039]:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@4f3e230b
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 17 20:59:42 mx-test tika[25039]: Caused by:
java.io.IOException: Page tree root must be a dictionary
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: ... 37 more
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25]
20:59:42,499 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the
data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78,
ContentType: text/plain
where, the attachment is,
pdfinfo Get_Started_With_Smallpdf.pdf
Creator: Adobe InDesign 15.1 (Macintosh)
Producer: Adobe PDF Library 15.0
CreationDate: Wed Oct 14 11:08:10 2020 EDT
ModDate: Wed Oct 14 11:08:10 2020 EDT
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 69451 bytes
Optimized: no
PDF version: 1.7
i don't see any additional DEBUG info, or the file length targeted.
additional steps/config needed to enable the DEBUG output from the snapshot?