Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Tilman Hausherr Fri, 15 Jul 2022 19:44:34 -0700

That's what I also get.

The next that could be done is to debug this, if possible. Tim suggestedthe file might be truncated.

I don't know if it is possible, if you can run tika in a debugger, thenstop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where theexception "Page tree root must be a dictionary" happens. There try toaccess this.fileLen . Compare that number to your file length.

(I'm wondering if we are offering some debug info in the tika server, orif we could offer it in the future, e.g. telling the length, and/oroffering an MD5 checksum if log debug mode is on)

An alternative would be that 1) I add the file length in PDFBoxexception 2) you create a Tika build with the PDFBox snapshot.


Tilman

Am 15.07.2022 um 18:26 schrieb PGNet Dev:

On 7/15/22 12:01 PM, Tim Allison wrote:
If you curl the test file (GetStartedWithSmallpdf.pdf) against yourtika-server, what do you see? The test file works for me with2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
If you curl the test file (GetStartedWithSmallpdf.pdf) against yourtika-server, what do you see?
in journal log, only this:
Jul 15 12:24:47 mx.loc tika[1143]: INFO [qtp1837533591-23]12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika(application/pdf)
and, @ console, this:

    https://pastebin.com/raw/Nu1RCbat
Are the files getting truncated somehow?
Perhaps? I'd guess that since curl of the source file against tika ,as above, works ok, that what's feeding tika -- namely dovecot's ftsplugin -- would be a likely candidate.

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Reply via email to