On 7/18/22 11:05 PM, Tilman Hausherr wrote:
Yes the file is deleted...


Alternatively, grab the source code from the trunk, and add this line in the 
file
tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java

Files.write(Paths.get("/tmp/yourfile.pdf"), 
Files.readAllBytes(tstream.getPath()));

after the line that has ", md5: ".

Then build the parser module, and then the standard server subproject with "mvn 
-DskipTests install".

1st, attempting the build, FAILs

        cd src/tika
        EDIT 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

                        ...
        168        if (LOG.isDebugEnabled() && tstream != null) {
                                LOG.debug("File: " + tstream.getPath() + ", length: 
" + tstream.getLength() +
                                                ", md5: " + 
calcMD5(tstream.getPath()));
                +               Files.write(Paths.get("/tmp/yourfile.pdf"), 
Files.readAllBytes(tstream.getPath()));
                        }
                        ...


        mvn install -pl tika-parsers -am
        mvn -DskipTests install
                ...
                [INFO] BUILD FAILURE
                [INFO] 
------------------------------------------------------------------------
                [INFO] Total time:  31.493 s
                [INFO] Finished at: 2022-07-19T04:48:43-04:00
                [INFO] 
------------------------------------------------------------------------
                [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check (validate) on project 
tika-parser-pdf-module: You have 1 Checkstyle violation. -> [Help 1]


try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so that you 
get that file.

next, run in debugger instead,

        sudo -u tika /usr/bin/jdb \
         -classpath /srv/tika/tika-server.jar \
         org.apache.tika.server.core.TikaServerCli \
         -c /etc/tika/tika-server-config-custom.xml

                Initializing jdb ...

set breakpoint

        > stop in org.apache.tika.parser.pdf.PDFParser
        Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
        It will be set after the class is loaded.

run it

        > run
        run org.apache.tika.server.core.TikaServerCli -c 
/etc/tika/tika-server-config-custom.xml
        Set uncaught java.lang.Throwable
        Set deferred uncaught java.lang.Throwable
        >
        VM Started: DEBUG [pool-2-thread-1] 05:21:37,469 
org.apache.tika.server.core.TikaServerWatchDog forked process commandline: 
[/usr/bin/java, -Xms1g, -Xmx1g, -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug, 
-Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar, -Dtika.server.id=, 
org.apache.tika.server.core.TikaServerProcess, -h, 127.0.0.1, -p, 9998, -i, , 
-c, /etc/tika/tika-server-config-custom.xml, -forkedStatusFile, 
/tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
        ...
        DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl register 
the server to serverRegistry
        TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class 
org.apache.tika.server.core.ServerStatusWatcher
        INFO  [main] 05:21:50,906 org.apache.tika.server.core.TikaServerProcess 
Started Apache Tika server  at http://127.0.0.1:9998/

receive email+attachment

*lots* of debug logs @ jdb console,

        -> https://pastebin.com/HDtR9RKP

NOTE, there,

        ...
        DEBUG [qtp485047320-31] 05:22:58,423 
org.apache.tika.parser.pdf.PDFParser File: 
/tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5: 
092bf24b2cac33fac27965549c99613a
        ...

but, no file captured

        ls -al /tmp/apache-tika*tmp
                ls: cannot access '/tmp/apache-tika*tmp': No such file or 
directory

is there anything informative in that now-more-verbose DEBUG output?



Reply via email to