That's what I also get.
The next that could be done is to debug this, if possible. Tim suggested
the file might be truncated.
I don't know if it is possible, if you can run tika in a debugger, then
stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the
exception "Page tree root must be a dictionary" happens. There try to
access this.fileLen . Compare that number to your file length.
(I'm wondering if we are offering some debug info in the tika server, or
if we could offer it in the future, e.g. telling the length, and/or
offering an MD5 checksum if log debug mode is on)
An alternative would be that 1) I add the file length in PDFBox
exception 2) you create a Tika build with the PDFBox snapshot.
Tilman
Am 15.07.2022 um 18:26 schrieb PGNet Dev:
On 7/15/22 12:01 PM, Tim Allison wrote:
If you curl the test file (GetStartedWithSmallpdf.pdf) against your
tika-server, what do you see? The test file works for me with
2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
If you curl the test file (GetStartedWithSmallpdf.pdf) against your
tika-server, what do you see?
in journal log, only this:
Jul 15 12:24:47 mx.loc tika[1143]: INFO [qtp1837533591-23]
12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika
(application/pdf)
and, @ console, this:
https://pastebin.com/raw/Nu1RCbat
Are the files getting truncated somehow?
Perhaps? I'd guess that since curl of the source file against tika ,
as above, works ok, that what's feeding tika -- namely dovecot's fts
plugin -- would be a likely candidate.