On 7/15/22 10:43 PM, Tilman Hausherr wrote:
That's what I also get.

The next that could be done is to debug this, if possible. Tim suggested the 
file might be truncated.

I don't know if it is possible, if you can run tika in a debugger, then stop at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the exception "Page tree 
root must be a dictionary" happens. There try to access this.fileLen . Compare that 
number to your file length.

1st stab at debugging this, i launch tika with debug tooling,

        /usr/bin/java \
         
-agentlib:jdwp=transport=dt_socket,address=127.0.0.1:8080,server=y,suspend=n \
         -jar /srv/tika/tika-server.jar \
         -c /etc/tika/tika-server-config-custom.xml

in another shell, attach the debugger

        jdb -attach 127.0.0.1:8080

then set the bp

        > stop in org.apache.pdfbox.pdfparser.PDFParser.initialParse
                Deferring breakpoint 
org.apache.pdfbox.pdfparser.PDFParser.initialParse.
                It will be set after the class is loaded.

i then send/receive the email with PDF attachment -- through dovecot>tika -- as 
above

i again see the scan-fail error in tika logs, but never see a

        Breakpoint hit: ...

dumping at prompt anyway,

        > dump this.fileLen
                No current thread
                 this.fileLen = null
        > threads
                Group system:
                  (java.lang.ref.Reference$ReferenceHandler)2788 Reference 
Handler   running
                  (java.lang.ref.Finalizer$FinalizerThread)2789  Finalizer      
     cond. waiting
                  (java.lang.Thread)2790                         Signal 
Dispatcher   running
                  (java.lang.Thread)2791                         Notification 
Thread running
                  (java.lang.Thread)2792                         process reaper 
     running
                Group main:
                  (java.lang.Thread)1                            main           
     cond. waiting
                  (java.lang.Thread)2780                         
pool-2-thread-1     cond. waiting
                  (java.lang.Thread)2795                         Thread-2       
     running
                Group InnocuousThreadGroup:
                  (jdk.internal.misc.InnocuousThread)2796        Common-Cleaner 
     cond. waiting

am i even setting the stop correctly, in order to get at the fail?

An alternative would be that 1) I add the file length in PDFBox exception 2) 
you create a Tika build with the PDFBox snapshot.

atm, i'm not building tika-server myself. rather, using just the DL'd runnable 
jar from

        https://dlcdn.apache.org/tika/2.4.1/tika-server-standard-2.4.1.jar


Reply via email to