That's what I also get.

The next that could be done is to debug this, if possible. Tim suggested the file might be truncated.

I don't know if it is possible, if you can run tika in a debugger, then stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the exception "Page tree root must be a dictionary" happens. There try to access this.fileLen . Compare that number to your file length.

(I'm wondering if we are offering some debug info in the tika server, or if we could offer it in the future, e.g. telling the length, and/or offering an MD5 checksum if log debug mode is on)

An alternative would be that 1) I add the file length in PDFBox exception 2) you create a Tika build with the PDFBox snapshot.

Tilman

Am 15.07.2022 um 18:26 schrieb PGNet Dev:
On 7/15/22 12:01 PM, Tim Allison wrote:
If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see?  The test file works for me with 2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?


If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see?

in journal log, only this:

    Jul 15 12:24:47 mx.loc tika[1143]: INFO  [qtp1837533591-23] 12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)

and, @ console, this:

    https://pastebin.com/raw/Nu1RCbat



Are the files getting truncated somehow?

Perhaps?  I'd guess that since curl of the source file against tika , as above, works ok, that what's feeding tika -- namely dovecot's fts plugin -- would be a likely candidate.


Reply via email to